# Segmenting and Clustering Neighborhoods in Toronto, Canada  
#### a Jupyter notebook for the Applied Data Science Course on Coursera for IBM's professional data science certificate

![alt text](https://www.thoughtco.com/thmb/29oq-pw_IipfasoDdPyw4L3WpFk=/768x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/national-flag-canada-lge2-56a0e57f5f9b58eba4b4f422.jpg)

## Table of Contents

1. Scraping the Wikipedia page & creating a cleaned up pandas dataframe
2. Merging geographical coordinates with our pandas postal code dataframe
3. Explore and Cluster the neighborhoods of Toronto

## Part 1: Scaping the Wikipedia page that lists the postal codes of Toronto, Canada and creating a cleaned up pandas dataframe

Let's get started by setting up all the dependancies we'll need for this project

In [26]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from pandas import DataFrame
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import numpy as np

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-3.1.0               |           py36_0         724 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    certifi-2019.6.16          |           py36_1         149 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.2 MB

The following NEW packages will be 

We would like to scrape a table from the Wikipedia page : 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'  

List of postal codes of Canada: M "This is a list of postal codes in Canada where the first letter is M. Postal codes beginning with M are located within the city of Toronto in the province of Ontario. Only the first three characters are listed, corresponding to the Forward Sortation Area."

We will do this in the notebook using [Requests](https://2.python-requests.org/en/master/) and [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)  
Requests and Beautiful Soup 4 are both python libraries that will simplify the scraping process for us.

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' #the url for the wikipedia site we'd like to scrape ou info from
source = requests.get( url ).text
soup = BeautifulSoup( source, 'lxml' )
print( soup.prettify() )

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className=document.documentElement.className.replace(/(^|\s)client-nojs(\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":906439794,"wgRevisionId":906439794,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June",

If we scroll through and inspect the html, we can see that the part where our table starts is indicated by: "table class="wikitable sortable""  
we can use this to parse out the information that we'd like to format in our dataframe

In [4]:
torontoPostalCode_html = soup.find( 'table', class_='wikitable sortable' )

#formatting the html into a usable list
torontoPostalCode_list = [ ]
#each row in html is demarcated by 'td', so let's go row by row to pull out the fields of information.
for rows in torontoPostalCode_html.find_all( 'td' ):
    row = rows.text #get just the text (not the other html stuff)
    row = row.replace( '\n', '' ) #while we're at it, let's drop all the '\n'
    torontoPostalCode_list.append( row ) #add this row to our list
    
torontoPostalCode_list

['M1A',
 'Not assigned',
 'Not assigned',
 'M2A',
 'Not assigned',
 'Not assigned',
 'M3A',
 'North York',
 'Parkwoods',
 'M4A',
 'North York',
 'Victoria Village',
 'M5A',
 'Downtown Toronto',
 'Harbourfront',
 'M5A',
 'Downtown Toronto',
 'Regent Park',
 'M6A',
 'North York',
 'Lawrence Heights',
 'M6A',
 'North York',
 'Lawrence Manor',
 'M7A',
 "Queen's Park",
 'Not assigned',
 'M8A',
 'Not assigned',
 'Not assigned',
 'M9A',
 'Etobicoke',
 'Islington Avenue',
 'M1B',
 'Scarborough',
 'Rouge',
 'M1B',
 'Scarborough',
 'Malvern',
 'M2B',
 'Not assigned',
 'Not assigned',
 'M3B',
 'North York',
 'Don Mills North',
 'M4B',
 'East York',
 'Woodbine Gardens',
 'M4B',
 'East York',
 'Parkview Hill',
 'M5B',
 'Downtown Toronto',
 'Ryerson',
 'M5B',
 'Downtown Toronto',
 'Garden District',
 'M6B',
 'North York',
 'Glencairn',
 'M7B',
 'Not assigned',
 'Not assigned',
 'M8B',
 'Not assigned',
 'Not assigned',
 'M9B',
 'Etobicoke',
 'Cloverdale',
 'M9B',
 'Etobicoke',
 'Islington',
 'M9B',
 

That's great! so we've just scraped all the text we need for our dataframe.  
But, it's in a list form and not organized by columns. yet.....that's what we need to do next

In [5]:
#we know from looking at the wikipedia page that the table has 3 columns
numColumns = 3
#a simple list comprehension can reshape things to better represent the tabular relationship of the data
torontoPostalCode_list = [ torontoPostalCode_list[ i:i+numColumns ] for i in range( 0, len( torontoPostalCode_list ), numColumns ) ]
torontoPostalCode_list

[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M5A', 'Downtown Toronto', 'Regent Park'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', "Queen's Park", 'Not assigned'],
 ['M8A', 'Not assigned', 'Not assigned'],
 ['M9A', 'Etobicoke', 'Islington Avenue'],
 ['M1B', 'Scarborough', 'Rouge'],
 ['M1B', 'Scarborough', 'Malvern'],
 ['M2B', 'Not assigned', 'Not assigned'],
 ['M3B', 'North York', 'Don Mills North'],
 ['M4B', 'East York', 'Woodbine Gardens'],
 ['M4B', 'East York', 'Parkview Hill'],
 ['M5B', 'Downtown Toronto', 'Ryerson'],
 ['M5B', 'Downtown Toronto', 'Garden District'],
 ['M6B', 'North York', 'Glencairn'],
 ['M7B', 'Not assigned', 'Not assigned'],
 ['M8B', 'Not assigned', 'Not assigned'],
 ['M9B', 'Etobicoke', 'Cloverdale'],
 ['M9B', 'Etobicoke', 'Islington'],
 ['M9B', 

Now that our list is formatted in a way that reflects the structure on the wikipedia html table, lets recast the list as a pandas dataframe

In [15]:
torontoPostalCode_df = DataFrame.from_records( torontoPostalCode_list )
torontoPostalCode_df.columns = ['Postal Code', 'Borough', 'Neighbourhood']
torontoPostalCode_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


This is a great start, but let's clean the data to the standards that the assignment specifies  
  
1) Let's only process the rows that have an assigned borough. So, let's drop all the rows that contain "Not assigned"  Borough values.

In [16]:
torontoPostalCode_df = torontoPostalCode_df[~torontoPostalCode_df.Borough.str.contains("Not assigned")]
torontoPostalCode_df.head( 10 )

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


2) If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough. 

In [17]:
mask = torontoPostalCode_df['Neighbourhood'] == 'Not assigned'
torontoPostalCode_df.loc[mask, 'Neighbourhood'] = torontoPostalCode_df.loc[mask, 'Borough']
torontoPostalCode_df.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


3) More than one neighborhood can exist in one postal code area. Combine neighborhoods in the same postal code into one line separated by a comma.

In [19]:
torontoPostalCode_gb = torontoPostalCode_df.groupby( [ 'Postal Code', 'Borough' ], as_index=False ).apply( lambda group: ', '.join( group[ 'Neighbourhood' ] ) )
torontoPostalCode_df = pd.DataFrame( torontoPostalCode_gb ).reset_index( )
torontoPostalCode_df.columns = [ 'Postal Code', 'Borough', 'Neighbourhood' ]
torontoPostalCode_df.head( 12 )

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [20]:
torontoPostalCode_df.shape

(103, 3)

##  Part 2: Merging geographical coordinates with our pandas postal code dataframe

create a new pandas dataframe from the csv file provided that has the peospatial data for Toronto postal codes. merge this dataframe with the previously created toronto postal code dataframe from our web scraping of wikipedia. In the hidden following cell, the csv has been previously uploaded to the cloud and is cast as a pandas dataframe by some auto-generated code from the IBM Watson environment.

In [49]:

import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

#@hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share your notebook.
client_673d1eb488454c2aaf3d8511524c61d4 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='EoKsjxbKfv6ZhvHU7dtDbqMXyqQRJI_ySr66m0zndj2U',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_673d1eb488454c2aaf3d8511524c61d4.get_object(Bucket='segmentingampclusteringneighborho-donotdelete-pr-aozesu2wrmnmxf',Key='Geospatial_Coordinates.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

torontoGeospatial_df = pd.read_csv(body)
torontoGeospatial_df.head()



Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now we simply merge the two dataframes with  the merge method according to the shared values in the 'Postal Code' collumn

In [36]:
torontoMerged_df = pd.merge( torontoPostalCode_df, torontoGeospatial_df, on='Postal Code' )
torontoMerged_df.head( 12 )

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


## Part 3: Explore and Cluster the neighborhoods of Toronto

Start with visualizing the postal codes we have in our dataframe  
  
first we will will get the coordinates for Toronto, Canada with the geolocator library to be used to center our maps

In [27]:
address = 'Toronto, CA'
geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


next, we will create our folium map centered on the Toronto coordinates that has a marker for each postal code in our newly merged geospatial Toronto dataframe

In [28]:
# create map of Toronto using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=10.5)

# add markers to map
for lat, lng, borough, neighborhood in zip(torontoMerged_df['Latitude'], torontoMerged_df['Longitude'], torontoMerged_df['Borough'], torontoMerged_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

that's great!, but next we will build a more sophisticated map that clusters areas based on similar venue with information pulled from the Foursquare API

In [50]:
#@hidden_cell
CLIENT_ID = 'CVVRIA2FX2EZFCLQ5GBL0YDPCQEPSZDXJ2NZ1MG5JL3CNVD5' # your Foursquare ID
CLIENT_SECRET = 'V5B5PF5BTHV2NMGPK0HPVTQEM4S2AGXUQXV1X4XMKDQZXF0B' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

In [38]:
#Let's borrow this function from the Exploring Neighborhoods in Manhattan lab
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now we run the above function on each neighborhood and create a new dataframe called toronto_venues

In [39]:
LIMIT = 100
toronto_venues = getNearbyVenues(names=torontoMerged_df['Neighbourhood'],
                                   latitudes=torontoMerged_df['Latitude'],
                                   longitudes=torontoMerged_df['Longitude']
                                  )

print(toronto_venues.shape)
toronto_venues.head()

(2255, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Affordable Toronto Movers,43.787919,-79.162977,Moving Target
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


Now let's analyze each neighborhood with one hot encoding

In [40]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

# group rows by neighborhood and by taking the mean of the frequency of occurence of each category
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now let's use the function from the Exploring Neighborhoods in Manhattan lab that sorts venues in descending order....

In [41]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

....and we will create new dataframe to display the top 10 venues in each neighborhood

In [42]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,...,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue,21th Most Common Venue,22th Most Common Venue,23th Most Common Venue,24th Most Common Venue,25th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Bar,Steakhouse,Thai Restaurant,Asian Restaurant,Cosmetics Shop,Hotel,Restaurant,...,Pizza Place,Sushi Restaurant,Concert Hall,Gastropub,Noodle House,Poke Place,Brazilian Restaurant,Salon / Barbershop,Colombian Restaurant,Ice Cream Shop
1,Agincourt,Lounge,Breakfast Spot,Clothing Store,Skating Rink,Women's Store,Dumpling Restaurant,Dive Bar,Dog Run,Doner Restaurant,...,Ethiopian Restaurant,Event Space,Discount Store,Dim Sum Restaurant,Farmers Market,Dessert Shop,Department Store,Deli / Bodega,Dance Studio,Curling Ice
2,"Agincourt North, L'Amoreaux East, Milliken, St...",Park,Playground,Women's Store,Drugstore,Diner,Discount Store,Dive Bar,Dog Run,Doner Restaurant,...,Ethiopian Restaurant,Event Space,Dim Sum Restaurant,Department Store,College Stadium,Deli / Bodega,Dance Studio,Curling Ice,Cupcake Shop,Cuban Restaurant
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",Grocery Store,Pharmacy,Japanese Restaurant,Fast Food Restaurant,Beer Store,Discount Store,Sandwich Place,Fried Chicken Joint,Coffee Shop,...,Drugstore,Donut Shop,Doner Restaurant,Dog Run,Dive Bar,Comic Shop,Concert Hall,Diner,Convenience Store,Dim Sum Restaurant
4,"Alderwood, Long Branch",Pizza Place,Pharmacy,Pub,Sandwich Place,Gym,Coffee Shop,Skating Rink,Dim Sum Restaurant,Diner,...,Drugstore,Dumpling Restaurant,Eastern European Restaurant,Dessert Shop,Curling Ice,Deli / Bodega,Dance Studio,Empanada Restaurant,Cupcake Shop,Cuban Restaurant


Cluster Neighborhoods: run a k-means to cluster neighborhoods into 5 clusters based on venue feature similarity

In [43]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 3, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

here we create a new dataframe that includes the cluster result as well as the top 10 venues of each neighborhood

In [45]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = torontoMerged_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood', how = 'right')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,...,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue,21th Most Common Venue,22th Most Common Venue,23th Most Common Venue,24th Most Common Venue,25th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,1,Fast Food Restaurant,Drugstore,Diner,Discount Store,...,Falafel Restaurant,Dumpling Restaurant,Dessert Shop,Department Store,Deli / Bodega,Dance Studio,Curling Ice,Cupcake Shop,Cuban Restaurant,Creperie
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0,Moving Target,Bar,Women's Store,Donut Shop,...,Event Space,Diner,Dessert Shop,Farmers Market,Department Store,Deli / Bodega,Dance Studio,Curling Ice,Cupcake Shop,Cuban Restaurant
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,0,Intersection,Mexican Restaurant,Breakfast Spot,Pizza Place,...,Dumpling Restaurant,Eastern European Restaurant,Donut Shop,Department Store,Dessert Shop,Ethiopian Restaurant,Deli / Bodega,Dance Studio,Curling Ice,Cupcake Shop
3,M1G,Scarborough,Woburn,43.770992,-79.216917,0,Coffee Shop,Korean Restaurant,Women's Store,Discount Store,...,Event Space,Diner,Dessert Shop,Farmers Market,Department Store,Deli / Bodega,Dance Studio,Curling Ice,Cupcake Shop,Cuban Restaurant
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,0,Lounge,Bakery,Hakka Restaurant,Athletics & Sports,...,Electronics Store,Empanada Restaurant,Ethiopian Restaurant,Event Space,Falafel Restaurant,Discount Store,Women's Store,Diner,Farmers Market,Dessert Shop


Finally, we will visualize the resulting clusters!

In [46]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## thank you for viewing my geospatial k-means clustering assignment  
  
![alt text](https://www.jgt.ie/wp-content/uploads/2014/11/toronto-154805.jpg)