# Week 3 Assignment: Segmenting and Clustering Neighborhoods in Toronto

## Part 1: Webscraping
First, the imports:
- Pandas for database management
- Requests to scrape the webpage
- BeutifulSoup to navigate the html

In [1]:
import pandas as pd
# Set high but not unlimited max rows and columns, to void overstressing my machine
pd.options.display.max_rows = 250
pd.options.display.max_columns = 100
import requests
from bs4 import BeautifulSoup

The web page used claims to contain a table with every postal code in Toronto, making it perfect for our needs. 
*Unfortunately, I had no easy way to verify it's accuracy, so the following lab assumes the Wikipedia article remains accurate*

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

html_data = requests.get(url).text

toronto_soup = BeautifulSoup(html_data,"html5lib")

The url is scraped using ```requests.get``` and formatted as a BeautifulSoup object, which makes it possible to identify the tables in the html using ```soup.find_all('table')```

In [3]:
toronto_tables = toronto_soup.find_all('table')
len(toronto_tables)

3

Since there were only 3 tables, finding the correct one manually was easier than writing a code loop. This was done by skimming the results of:  

```print(toronto_tables[n].prettify())``` for ```0```, ```1```, and ```2```  

Table 0 contains the neighborhood data

In [4]:
toronto_table = toronto_tables[0]

### Creating and Cleaning the Dataframe
Now that we have the proper table, the following cells serve to enter the data into the Pandas dataframe in the desired form.  This notebook assumes that the first 3 non-whitespace characters of every cell make up the postal code, and that the neighborhoods are always separated from the borough by an open parenthesis '(' 

***Further details regarding the reformatting are explained in comments in the code below***

In [5]:
# I create the dataframe with the named columns, it's empty for now
toronto_df = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood'])
toronto_df

# loop through all of the data cells in the table and populate the dataframe
for cell in toronto_table.find_all('td'):
    text = cell.text.strip()
    # Skip any cells that aren't assigned
    if 'Not assigned' not in text:
        # The postal codes are always the first 3 characters of the cell, this make it easy to split off using slicing 
        postalcode = text[0:3]
        
        # The remainder of the text has to be split along the opening parenthesis, and then the neighborhoods have to be reformatted
        other = text[3:].split('(')
        borough = other[0]
        neighborhood = (((other[1].strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        toronto_df = toronto_df.append({'PostalCode': postalcode,
                                    'Borough': borough,
                                    'Neighborhood': neighborhood}, ignore_index=True)


In [6]:
toronto_df['Borough'].value_counts()

North York                                                      24
Scarborough                                                     17
Downtown Toronto                                                17
Etobicoke                                                       11
Central Toronto                                                  9
West Toronto                                                     6
York                                                             5
East York                                                        4
East Toronto                                                     4
EtobicokeNorthwest                                               1
Queen's Park                                                     1
MississaugaCanada Post Gateway Processing Centre                 1
East TorontoBusiness reply mail Processing Centre969 Eastern     1
East YorkEast Toronto                                            1
Downtown TorontoStn A PO Boxes25 The Esplanade                

In [7]:
# There are a handful of Boroughs that didn't get proccessed properly, so let's fix them
toronto_df['Borough']=toronto_df['Borough'].replace({'MississaugaCanada Post Gateway Processing Centre':'Mississauga',
                                                 'EtobicokeNorthwest':'Etobicoke Northwest',
                                                 'East YorkEast Toronto':'East York/East Toronto',
                                                 'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                                 'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                                 })
toronto_df['Borough'].value_counts()

North York                24
Scarborough               17
Downtown Toronto          17
Etobicoke                 11
Central Toronto            9
West Toronto               6
York                       5
East Toronto               4
East York                  4
Queen's Park               1
East Toronto Business      1
Etobicoke Northwest        1
Downtown Toronto Stn A     1
East York/East Toronto     1
Mississauga                1
Name: Borough, dtype: int64

### The Dataframe should now be complete! Lets take a look

In [8]:
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [9]:
toronto_df.shape

(103, 3)

## Part 2: Geographical Cordinates

***I wasn't able to get the geocoder library to work.  It would loop forever on the first postal code.  I don't have the experience with it to trouble shoot so for lack of time I decided to just use the CSV file provided***
<details>
<summary>(Dropdown for my attempted code) ↓</summary>
<p>
    
```python
# To start install and import geocoder
!pip install geocoder
import geocoder
    
# Initiate the empty lists of coordinates, to add the the dateframe
lat_list = []
lng_list = []

# debug code
n = 0
# loop until you get the coordinates
for postal_code in neigh_df['PostalCode']:
    # debug code
    i = 0
    # initialize your variable to None
    lat_lng_coords = None
    while(lat_lng_coords is None):
        # debug print
        print('Atempt {} for postal code {}'.format(i, n))
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
        i = i+1
    lat_list.append(lat_lng_coords[0])
    lng_list.append(lat_lng_coords[1])
    n = n+1
```
</p>
</details>
<br>
<br>

### Grabbing the csv and creating a temporary dataframe

In [10]:
import io

In [11]:
url = 'http://cocl.us/Geospatial_data'

geo_csv = requests.get(url).content

geo_df = pd.read_csv(io.StringIO(geo_csv.decode('utf-8')))

### Merging the two dataframes using ```pd.merge```

In [12]:
# The Postal Code column in the geo_df is renamed to match that of the existing dataframe, for ease of merging
geo_df.rename(columns={'Postal Code':'PostalCode'}, inplace=True)

toronto_df = pd.merge(toronto_df, geo_df, how='left')

In [13]:
toronto_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## Part 3: Exploration and Clustering

In [14]:
import numpy as np # library to handle data in a vectorized manner

import json # library to handle JSON files

!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale
!pip install folium
import folium # map rendering library

print('Libraries imported.')

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Libraries imported.


### Getting the cordinates of each neighborhood
The database above has the latitude and logitude of the postal codes, rather than that of the specific neighborhoods.  The difference tend to be subtle, but it's a problem when trying to compare two neighborhoods within the same postal code.  Rather than clustering the postal codes I decided to use ```Nominatim``` to get the lat and longitude of each neighborhood. 

In [15]:
# Create a geolocator agent
geolocator = Nominatim(user_agent="tor_explorer")


Create a list of each neighborhood, seperating any neighborhoods that exist within the same postal code.  
This solution is a little dense, but it was the most elegant I could find.

In [16]:
# Split the list of each neighborhood on the comma seperator
temp = [i.split(', ') for i in toronto_df['Neighborhood'].tolist()]
# Flaten the new list-of-lists
neigh_list = [item for sublist in temp for item in sublist]
neigh_list[0:10]

['Parkwoods',
 'Victoria Village',
 'Regent Park',
 'Harbourfront',
 'Lawrence Manor',
 'Lawrence Heights',
 'Ontario Provincial Government',
 'Islington Avenue',
 'Malvern',
 'Rouge']

Run a loop to get the latitude and logiude of each neighborhood using ```Nominatim```, and add it to a new dataframe ```neigh_df```

In [17]:
# Create the dataframe to contain the neighborhood cordinate data
neigh_df = pd.DataFrame(columns=['Neighborhood', 'Latitude', 'Longitude'])

# Populate the dataframe
for neighborhood in neigh_list:
    address = '{}, Toronto, ON, Canada'.format(neighborhood)
    location = geolocator.geocode(address)
    try:
        latitude = location.latitude
        longitude = location.longitude
    except:
        latitude = np.nan
        longitude = np.nan
    neigh_df = neigh_df.append({'Neighborhood': neighborhood,
                                'Latitude': latitude,
                                'Longitude': longitude}, ignore_index=True)

In [18]:
neigh_df.loc[neigh_df['Latitude'].isna()]

Unnamed: 0,Neighborhood,Latitude,Longitude
6,Ontario Provincial Government,,
37,Caledonia-Fairbanks,,
102,Keelsdale and Silverthorn,,
128,North Midtown,,
132,Enclave of L4W,,
169,Humber Bay Shores,,
175,Beaumond Heights,,
202,Enclave of M4L,,


The geo look up failed on the 8 neighborhoods above, I assume because their addressess don't fit the same pattern as the rest (ie. ```'{Neighborhood}, Toronto, ON, Canada'```), or because they represent placeholder names in the original table.  Rather than try to find the correct address for each of them, I decided to simply drop them. While I was at it, I dropped two duplicate rows as well.

In [19]:
print(neigh_df.shape)
neigh_df.dropna(inplace=True)
print(neigh_df.shape)
neigh_df.drop_duplicates(inplace=True)
print(neigh_df.shape)

(216, 3)
(208, 3)
(206, 3)


### Plotting the map of the neighborhoods using the newly created dataframe. 

In [20]:
# Create map of Toronto using latitude and longitude values
# Toronto is located at 43.6532° N, 79.3832° W according to a quick search
toronto_map = folium.Map(location=[43.6532, -79.3832], zoom_start=11)

# Add markers to map
# Code taken with slight alterations from lab 3-3-2. No need to reinvent the wheel
for lat, lng, neighborhood in zip(neigh_df['Latitude'], neigh_df['Longitude'], neigh_df['Neighborhood']):
    label = neighborhood
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
    
toronto_map

### Exploring nearby venues in Foursquare

*Hidden cell containing Foursquare credentials*

In [21]:
# The code was removed by Watson Studio for sharing.

Here I borrow the ```getNearbyVenues``` function from the 3-3-2 lab again. As before, no need to reinvent to wheel.  
It will return the top venues within 500 meters of each neighborhood input, up to 100 venues per neighborhood, based on Foursquares recomendation system. 
It uses the Foursquare API *Get* method with the *Explore* endpoint.

In [22]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Then run that function on the entire neighborhood dataframe

In [23]:
toronto_venues = getNearbyVenues(names=neigh_df['Neighborhood'], latitudes=neigh_df['Latitude'], longitudes=neigh_df['Longitude'])

In [24]:
toronto_venues.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.7588,-79.320197,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
1,Parkwoods,43.7588,-79.320197,LCBO,43.757774,-79.314257,Liquor Store
2,Parkwoods,43.7588,-79.320197,Shoppers Drug Mart,43.760857,-79.324961,Pharmacy
3,Parkwoods,43.7588,-79.320197,Petro-Canada,43.75795,-79.315187,Gas Station
4,Parkwoods,43.7588,-79.320197,Pizza Pizza,43.760231,-79.325666,Pizza Place
5,Parkwoods,43.7588,-79.320197,TD Canada Trust,43.75744,-79.314838,Bank
6,Parkwoods,43.7588,-79.320197,Parkwoods Coin Laundry,43.760386,-79.324894,Laundry Service
7,Parkwoods,43.7588,-79.320197,Bus Stop: 95 & 24,43.758083,-79.314986,Bus Line
8,Parkwoods,43.7588,-79.320197,Family Food Fair,43.760422,-79.325012,ATM
9,Parkwoods,43.7588,-79.320197,Parkwoods Village Centre,43.760735,-79.324873,Shopping Mall


In [25]:
venue_counts = toronto_venues.groupby('Neighborhood').count()[['Venue']]
venue_counts.rename({'Venue': 'Venue Count'},inplace=True)
venue_counts.head(10)

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
Adelaide,100
Agincourt,11
Agincourt North,25
Albion Gardens,7
Alderwood,8
Bathurst Manor,4
Bathurst Quay,25
Bayview Village,13
Bedford Park,1
Berczy Park,100


In [26]:
print("Neighborhoods with more than 50 venues within 500 meters", venue_counts[venue_counts['Venue'] > 50].shape[0])
print("Neighborhoods with less than 50 venues within 500 meters", venue_counts[venue_counts['Venue'] < 50].shape[0])
print("Neighborhoods with less than 5 venues within 500 meters", venue_counts[venue_counts['Venue'] < 5].shape[0])

Neighborhoods with more than 50 venues within 500 meters 38
Neighborhoods with less than 50 venues within 500 meters 166
Neighborhoods with less than 5 venues within 500 meters 41


Unlike with the Manhattan Dataframe, a large portion of our neighborhoods have very few venues within 500 meters. There are more neighborhoods with less than 5 within the set range than there are with 50 or more.  This may make clustering them the same way we did the New York data inconsisent, due to the large amount of variance. 

To resolve this, I first decided to try extending the radius of search to 1 kilometer (or rather, 999 meters, since the api call would fail on 1000), due to Toronto's larger and more spread out lay out compared to New York. 

In [27]:
toronto_venues_1k = getNearbyVenues(names=neigh_df['Neighborhood'], latitudes=neigh_df['Latitude'], longitudes=neigh_df['Longitude'], radius=999)

In [28]:
venue_counts_1k = toronto_venues_1k.groupby('Neighborhood').count()[['Venue']]
venue_counts_1k.rename({'Venue':'Venue Count'},axis=1,inplace=True)
venue_counts_1k.head(10)

Unnamed: 0_level_0,Venue Count
Neighborhood,Unnamed: 1_level_1
Adelaide,100
Agincourt,38
Agincourt North,30
Albion Gardens,21
Alderwood,22
Bathurst Manor,18
Bathurst Quay,100
Bayview Village,49
Bedford Park,68
Berczy Park,100


In [29]:
print("Neighborhoods with 100 or more venues within 999 meters", venue_counts_1k[venue_counts_1k['Venue Count'] == 100].shape[0])
print("Neighborhoods with more than 50 venues within 999 meters", venue_counts_1k[venue_counts_1k['Venue Count'] > 50].shape[0])
print("Neighborhoods with less than 50 venues within 999 meters", venue_counts_1k[venue_counts_1k['Venue Count'] < 50].shape[0])
print("Neighborhoods with less than 20 venues within 999 meters", venue_counts_1k[venue_counts_1k['Venue Count'] < 20].shape[0])
print("Neighborhoods with less than 5 venues within 999 meters", venue_counts_1k[venue_counts_1k['Venue Count'] < 5].shape[0])

Neighborhoods with 100 or more venues within 999 meters 47
Neighborhoods with more than 50 venues within 999 meters 84
Neighborhoods with less than 50 venues within 999 meters 122
Neighborhoods with less than 20 venues within 999 meters 53
Neighborhoods with less than 5 venues within 999 meters 3


With a much more reasonable distribution, I decided to stick with this one.  Still, calculating the top 10 most common venue types with less than 20 venues doesn't seem very meaningful.  I'm proceeding with the clustering using this data, but chose not to evaluate the top 10 for each neighborhood individually. 

In [30]:
# one hot encoding
toronto_onehot = (pd.get_dummies(toronto_venues_1k[['Venue Category']], prefix="", prefix_sep=""))
# The Neighborhood column ended up somewhere in the middle of the pile, so droping it and inserting it again at the begining was the easiest way to clean up the table.
toronto_onehot.drop('Neighborhood', axis=1, inplace=True)
toronto_onehot.insert(0,'Neighborhood', toronto_venues_1k['Neighborhood'])
toronto_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,Auto Garage,Automotive Shop,BBQ Joint,Baby Store,Badminton Court,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Court,Basketball Stadium,Bath House,Beach,Beach Bar,Beer Bar,Beer Store,Belgian Restaurant,Big Box Store,Bike Shop,Bistro,Boat or Ferry,Bookstore,Botanical Garden,Boutique,Bowling Alley,Boxing Gym,...,Sporting Goods Shop,Sports Bar,Sports Club,Sri Lankan Restaurant,Stationery Store,Steakhouse,Storage Facility,Street Art,Supermarket,Supplement Shop,Sushi Restaurant,Swiss Restaurant,Syrian Restaurant,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tattoo Parlor,Tea Room,Tennis Court,Tennis Stadium,Tex-Mex Restaurant,Thai Restaurant,Theater,Theme Park,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Tourist Information Center,Toy / Game Store,Track,Trail,Train Station,Tunnel,Turkish Restaurant,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit
0,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


With the wide variation in the number of venues nearby, I felt that the total number of venues found would be a relevant feature to evaluate by, so I added that total in.

In [47]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped = toronto_grouped.merge(venue_counts_1k, how='left', on='Neighborhood')
toronto_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,Auto Garage,Automotive Shop,BBQ Joint,Baby Store,Badminton Court,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Court,Basketball Stadium,Bath House,Beach,Beach Bar,Beer Bar,Beer Store,Belgian Restaurant,Big Box Store,Bike Shop,Bistro,Boat or Ferry,Bookstore,Botanical Garden,Boutique,Bowling Alley,Boxing Gym,...,Sports Bar,Sports Club,Sri Lankan Restaurant,Stationery Store,Steakhouse,Storage Facility,Street Art,Supermarket,Supplement Shop,Sushi Restaurant,Swiss Restaurant,Syrian Restaurant,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tattoo Parlor,Tea Room,Tennis Court,Tennis Stadium,Tex-Mex Restaurant,Thai Restaurant,Theater,Theme Park,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Tourist Information Center,Toy / Game Store,Track,Trail,Train Station,Tunnel,Turkish Restaurant,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit,Venue Count
0,Adelaide,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,38
2,Agincourt North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,30
3,Albion Gardens,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21
4,Alderwood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22


### Clustering

In [32]:
# I chose 7 clusters for my K value
kclusters = 7

# I drop the Neighborhood label and then apply a standard scalar, to account for the different scale of the total Venue Count
toronto_grouped_clustering = scale(toronto_grouped.drop('Neighborhood', 1))

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=4).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 2, 2, 0, 0, 0, 2, 0, 2, 4], dtype=int32)

### Mapping the Clusters

In [33]:
# add clustering labels to the original neigh_df for mapping
map_df = neigh_df
map_df['Cluster Label'] = kmeans.labels_

map_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Label
0,Parkwoods,43.7588,-79.320197,4
1,Victoria Village,43.732658,-79.311189,2
2,Regent Park,43.660706,-79.360457,2
3,Harbourfront,43.64008,-79.38015,0
4,Lawrence Manor,43.722079,-79.437507,0


With the neighborhoods all clustered, I map each neighborhood with folium using color to indicate the clusters, borowing the mapping code from the lab once more.

In [34]:
# create map
clusters_map = folium.Map(location=[43.6532, -79.3832], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(map_df['Latitude'], map_df['Longitude'], map_df['Neighborhood'], map_df['Cluster Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(clusters_map)

clusters_map

The Clusters don't appear to have much in common in terms of geographic location at first glance.  Out of curiosity, I plotted the geographic center of each cluster on the map as well.

In [35]:
cluster_centers = map_df.groupby('Cluster Label')[['Latitude','Longitude']].mean().reset_index()
cluster_centers

Unnamed: 0,Cluster Label,Latitude,Longitude
0,0,43.708989,-79.406957
1,1,43.680461,-79.505196
2,2,43.694276,-79.401217
3,3,43.726328,-79.415904
4,4,43.698451,-79.415555
5,5,43.749045,-79.289843
6,6,43.6901,-79.363592


In [36]:
# add geographical average of each cluster's latitude and logitude to the map, 
markers_colors = []
for lat, lon, cluster in zip(cluster_centers['Latitude'], cluster_centers['Longitude'], cluster_centers['Cluster Label']):
    label = folium.Popup(' Cluster Center ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=15,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(clusters_map)

clusters_map

### Determining the top venues of each cluster

In [76]:
# Creating a new dataframe that includes all the relevant information, as I hadn't done so earlier.
cluster_venues = toronto_grouped.copy()
cluster_venues.insert(1, 'Cluster Label', kmeans.labels_)
cluster_venues = cluster_venues.groupby('Cluster Label').mean().reset_index()
cluster_venues

Unnamed: 0,Cluster Label,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,Auto Garage,Automotive Shop,BBQ Joint,Baby Store,Badminton Court,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Court,Basketball Stadium,Bath House,Beach,Beach Bar,Beer Bar,Beer Store,Belgian Restaurant,Big Box Store,Bike Shop,Bistro,Boat or Ferry,Bookstore,Botanical Garden,Boutique,Bowling Alley,Boxing Gym,...,Sports Bar,Sports Club,Sri Lankan Restaurant,Stationery Store,Steakhouse,Storage Facility,Street Art,Supermarket,Supplement Shop,Sushi Restaurant,Swiss Restaurant,Syrian Restaurant,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tattoo Parlor,Tea Room,Tennis Court,Tennis Stadium,Tex-Mex Restaurant,Thai Restaurant,Theater,Theme Park,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Tourist Information Center,Toy / Game Store,Track,Trail,Train Station,Tunnel,Turkish Restaurant,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit,Venue Count
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002213,0.0,0.0,0.000287,0.0,0.000354,0.0,0.0,0.000539,0.00822,0.003373,0.0,0.001819,0.000544,0.000849,0.000294,0.0,0.004075,0.018295,0.017033,0.000959,0.006275,0.0,0.0,0.0,0.000365,0.001339,0.0,0.0,0.012475,0.0,0.002414,0.000294,0.001095,0.0,0.000455,0.0,0.0,0.0,0.000463,...,0.006836,0.0,0.001416,0.0,0.001724,0.003838,0.0,0.012039,0.001893,0.004798,0.000524,0.0,0.0,0.0,0.000524,0.0,0.0,0.0,0.000669,0.004394,0.0,0.0,0.004347,0.003966,0.000524,0.0,0.004063,0.0,0.000227,0.001332,0.0,0.005726,0.012337,0.0,0.001234,0.0,0.001559,0.000694,0.000446,0.012431,0.000602,0.0,0.0,0.0,0.00217,0.001442,0.000335,0.0,0.008283,22.180723
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008329,0.0,0.0,0.0,0.0,0.0,0.002821,0.000769,0.000769,0.002567,0.004457,0.0,0.0,0.0,0.008275,0.0,0.0,0.005657,0.018238,0.011459,0.008464,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007885,0.001282,0.0,0.0,0.000769,0.002118,0.0,0.010192,0.000769,0.010192,0.0,0.0,...,0.001282,0.0,0.0,0.0,0.004426,0.0,0.0,0.004011,0.0,0.026854,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002821,0.001241,0.0,0.0,0.019787,0.004359,0.0,0.0025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015104,0.0,0.006413,0.003548,0.0,0.0,0.005293,0.0,0.001538,0.000769,0.0,0.012621,0.0,84.692308
2,2,0.000372,0.000916,0.000476,0.00222,0.000819,0.0007,0.00362,0.001401,0.002801,0.006847,0.000366,0.000238,0.00075,0.000119,0.0,0.001255,0.000833,0.003442,0.005963,0.003432,0.000357,0.0,0.000283,0.003966,0.0,0.000476,0.001607,0.024529,0.023215,0.018772,0.003228,0.000357,0.0,0.0,0.0,0.004284,0.000357,0.0025,0.00675,0.0,0.000326,0.000119,0.000119,0.002464,0.004258,0.000476,0.001657,0.00061,0.0,...,0.000965,0.000616,0.0,0.000119,0.001708,0.0,0.000119,0.003663,0.000966,0.025214,0.0,0.000294,0.000724,0.0,0.000238,0.000467,0.002476,0.001059,0.002974,0.000652,0.000441,0.000119,0.013494,0.004079,0.000372,0.000119,0.001483,0.000833,0.0,0.002182,0.000904,0.003914,0.000297,0.000119,0.001717,0.000119,0.004544,0.001119,0.000763,0.006945,0.001279,0.000238,0.001071,0.000175,0.000999,0.000783,0.0,0.004205,0.0,61.571429
3,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003333,0.0,0.0,0.0,0.0,0.003333,0.02433,0.0,0.017663,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010498,0.0,0.040996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017165,0.0,0.006667,0.0,0.0,0.0,0.0,0.007165,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.007165,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.003333,0.0,0.010498,0.0,0.0,0.0,0.003831,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003333,0.044828,0.0,0.0,0.017165,0.0,0.0,0.006667,0.0,0.0,0.0,0.0,0.020996,0.0,95.666667
4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008125,0.0,0.0,0.0,0.010625,0.0,0.01875,0.000625,0.000625,0.005,0.0,0.0,0.0,0.0,0.003125,0.0,0.0,0.0,0.0125,0.0,0.001875,0.0,0.010625,0.0,0.008125,0.0,0.0,0.0,0.02625,0.0,0.0,0.0,0.0,0.00875,0.0,0.00625,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.011875,0.0,0.0,0.00625,0.0,0.009375,0.0,0.0,0.0,0.00875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013125,0.023125,0.0,0.0,0.0,0.0,0.0,0.0,0.003125,0.0,0.009375,0.0,0.0,0.0,0.016875,0.0,0.0,0.0,0.0,0.0,0.00125,0.0,0.0,0.0,0.0,0.005,0.0,100.0
5,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.015,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0
6,6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.133333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0


In [39]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [73]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']


# create columns according to number of top venues
columns = ['Cluster Label']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
top_cluster_venues = pd.DataFrame(columns=columns)
top_cluster_venues['Cluster Label'] = cluster_venues['Cluster Label']

temp = cluster_venues.drop('Venue Count', axis=1)

for ind in np.arange(temp.shape[0]):
    top_cluster_venues.iloc[ind, 1:] = return_most_common_venues(temp.iloc[ind, :], num_top_venues)

top_cluster_venues['Venue Count'] = cluster_venues['Venue Count']

In [74]:
top_cluster_venues

Unnamed: 0,Cluster Label,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Venue Count
0,0,Coffee Shop,Park,Pizza Place,Fast Food Restaurant,Grocery Store,Pharmacy,Sandwich Place,Gas Station,Convenience Store,Chinese Restaurant,22.180723
1,1,Coffee Shop,Italian Restaurant,Café,Park,Sushi Restaurant,Grocery Store,Restaurant,Thai Restaurant,French Restaurant,Japanese Restaurant,84.692308
2,2,Coffee Shop,Restaurant,Café,Park,Italian Restaurant,Pizza Place,Sushi Restaurant,Bakery,Bank,Sandwich Place,61.571429
3,3,Café,Coffee Shop,Vegetarian / Vegan Restaurant,Bar,Mexican Restaurant,Park,Art Gallery,Sandwich Place,Yoga Studio,Arts & Crafts Store,95.666667
4,4,Coffee Shop,Café,Hotel,Restaurant,Japanese Restaurant,Park,Beer Bar,Gym,Seafood Restaurant,Theater,100.0
5,5,Coffee Shop,Japanese Restaurant,Restaurant,Gay Bar,Café,Gastropub,Thai Restaurant,Diner,Park,Juice Bar,100.0
6,6,Athletics & Sports,Sandwich Place,Soccer Field,Basketball Court,Escape Room,French Restaurant,Gym / Fitness Center,Recreation Center,American Restaurant,Gas Station,15.0


In [38]:
map_df['Cluster Label'].value_counts()

2    84
0    83
4    16
1    13
6     5
3     3
5     2
Name: Cluster Label, dtype: int64

### Conclusion
There doesn't appear to be a great deal of variation in the venue categories of each cluster, with 5 out of 7 having the same number one (6 of 7, if you consider the fact that Cafés and Coffe Shops fulfill a similar role). The vast majority of neighborhoods fell in to clusters 2 and 0, with the remaining 5 clusters combined failing to match even half of either of them, which suggests a flaw in experimental design. 

In the end, the most meaningful insights to take away from this is that comparing nearby venues is not an effective method to categorize or distinguish the neighborhoods in Toronto. 