# Week 3 Assignment: Segmenting and Clustering Neighborhoods in Toronto

## Part 1: Webscraping
First, the imports:
- Pandas for database management
- Requests to scrape the webpage
- BeutifulSoup to navigate the html

In [2]:
import pandas as pd
# Set high but not unlimited max rows and columns, to void overstressing my machine
pd.options.display.max_rows = 250
pd.options.display.max_columns = 100
import requests
from bs4 import BeautifulSoup

The web page used claims to contain a table with every postal code in Toronto, making it perfect for our needs. 
*Unfortunately, I had no easy way to verify it's accuracy, so the following lab assumes the Wikipedia article remains accurate*

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

html_data = requests.get(url).text

toronto_soup = BeautifulSoup(html_data,"html5lib")

The url is scraped using ```requests.get``` and formatted as a BeautifulSoup object, which makes it possible to identify the tables in the html using ```soup.find_all('table')```

In [4]:
toronto_tables = toronto_soup.find_all('table')
len(toronto_tables)

3

Since there were only 3 tables, finding the correct one manually was easier than writing a code loop. This was done by skimming the results of:  

```print(toronto_tables[n].prettify())``` for ```0```, ```1```, and ```2```  

Table 0 contains the neighborhood data

In [5]:
toronto_table = toronto_tables[0]

### Creating and Cleaning the Dataframe
Now that we have the proper table, the following cells serve to enter the data into the Pandas dataframe in the desired form.  This notebook assumes that the first 3 non-whitespace characters of every cell make up the postal code, and that the neighborhoods are always separated from the borough by an open parenthesis '(' 

***Further details regarding the reformatting are explained in comments in the code below***

In [6]:
# I create the dataframe with the named columns, it's empty for now
toronto_df = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood'])
toronto_df

# loop through all of the data cells in the table and populate the dataframe
for cell in toronto_table.find_all('td'):
    text = cell.text.strip()
    # Skip any cells that aren't assigned
    if 'Not assigned' not in text:
        # The postal codes are always the first 3 characters of the cell, this make it easy to split off using slicing 
        postalcode = text[0:3]
        
        # The remainder of the text has to be split along the opening parenthesis, and then the neighborhoods have to be reformatted
        other = text[3:].split('(')
        borough = other[0]
        neighborhood = (((other[1].strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        toronto_df = toronto_df.append({'PostalCode': postalcode,
                                    'Borough': borough,
                                    'Neighborhood': neighborhood}, ignore_index=True)


In [7]:
toronto_df['Borough'].value_counts()

North York                                                      24
Downtown Toronto                                                17
Scarborough                                                     17
Etobicoke                                                       11
Central Toronto                                                  9
West Toronto                                                     6
York                                                             5
East Toronto                                                     4
East York                                                        4
MississaugaCanada Post Gateway Processing Centre                 1
East YorkEast Toronto                                            1
EtobicokeNorthwest                                               1
East TorontoBusiness reply mail Processing Centre969 Eastern     1
Downtown TorontoStn A PO Boxes25 The Esplanade                   1
Queen's Park                                                  

In [8]:
# There are a handful of Boroughs that didn't get proccessed properly, so let's fix them
toronto_df['Borough']=toronto_df['Borough'].replace({'MississaugaCanada Post Gateway Processing Centre':'Mississauga',
                                                 'EtobicokeNorthwest':'Etobicoke Northwest',
                                                 'East YorkEast Toronto':'East York/East Toronto',
                                                 'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                                 'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                                 })
toronto_df['Borough'].value_counts()

North York                24
Downtown Toronto          17
Scarborough               17
Etobicoke                 11
Central Toronto            9
West Toronto               6
York                       5
East Toronto               4
East York                  4
Downtown Toronto Stn A     1
East Toronto Business      1
Etobicoke Northwest        1
East York/East Toronto     1
Mississauga                1
Queen's Park               1
Name: Borough, dtype: int64

### The Dataframe should now be complete! Lets take a look

In [9]:
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [10]:
toronto_df.shape

(103, 3)

## Part 2: Geographical Cordinates

***I wasn't able to get the geocoder library to work.  It would loop forever on the first postal code.  I don't have the experience with it to trouble shoot so for lack of time I decided to just use the CSV file provided***
<details>
<summary>(Dropdown for my attempted code) ↓</summary>
<p>
    
```python
# To start install and import geocoder
!pip install geocoder
import geocoder
    
# Initiate the empty lists of coordinates, to add the the dateframe
lat_list = []
lng_list = []

# debug code
n = 0
# loop until you get the coordinates
for postal_code in neigh_df['PostalCode']:
    # debug code
    i = 0
    # initialize your variable to None
    lat_lng_coords = None
    while(lat_lng_coords is None):
        # debug print
        print('Atempt {} for postal code {}'.format(i, n))
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
        i = i+1
    lat_list.append(lat_lng_coords[0])
    lng_list.append(lat_lng_coords[1])
    n = n+1
```
</p>
</details>
<br>
<br>

### Grabbing the csv and creating a temporary dataframe

In [11]:
import io

In [12]:
url = 'http://cocl.us/Geospatial_data'

geo_csv = requests.get(url).content

geo_df = pd.read_csv(io.StringIO(geo_csv.decode('utf-8')))

### Merging the two dataframes using ```pd.merge```

In [13]:
# The Postal Code column in the geo_df is renamed to match that of the existing dataframe, for ease of merging
geo_df.rename(columns={'Postal Code':'PostalCode'}, inplace=True)

toronto_df = pd.merge(toronto_df, geo_df, how='left')

In [14]:
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


## Part 3: Exploration and Clustering

In [15]:
import numpy as np # library to handle data in a vectorized manner

import json # library to handle JSON files

!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale
!pip install folium
import folium # map rendering library

print('Libraries imported.')

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 6.2 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1
Libraries imported.


### Getting the cordinates of each neighborhood
The database above has the latitude and logitude of the postal codes, rather than that of the specific neighborhoods.  The difference tend to be subtle, but it's a problem when trying to compare two neighborhoods within the same postal code.  Rather than clustering the postal codes I decided to use ```Nominatim``` to get the lat and longitude of each neighborhood. 

In [92]:
# Create a geolocator agent
geolocator = Nominatim(user_agent="tor_explorer")


Create a list of each neighborhood, seperating any neighborhoods that exist within the same postal code.  
This solution is a little dense, but it was the most elegant I could find.

In [93]:
# Split the list of each neighborhood on the comma seperator
temp = [i.split(', ') for i in toronto_df['Neighborhood'].tolist()]
# Flaten the new list-of-lists
neigh_list = [item for sublist in temp for item in sublist]
neigh_list

['Parkwoods',
 'Victoria Village',
 'Regent Park',
 'Harbourfront',
 'Lawrence Manor',
 'Lawrence Heights',
 'Ontario Provincial Government',
 'Islington Avenue',
 'Malvern',
 'Rouge',
 'Don Mills North',
 'Parkview Hill',
 'Woodbine Gardens',
 'Garden District',
 'Ryerson',
 'Glencairn',
 'West Deane Park',
 'Princess Gardens',
 'Martin Grove',
 'Islington',
 'Cloverdale',
 'Rouge Hill',
 'Port Union',
 'Highland Creek',
 'Don Mills South',
 'Woodbine Heights',
 'St. James Town',
 'Humewood-Cedarvale',
 'Eringate',
 'Bloordale Gardens',
 'Old Burnhamthorpe',
 'Markland Wood',
 'Guildwood',
 'Morningside',
 'West Hill',
 'The Beaches',
 'Berczy Park',
 'Caledonia-Fairbanks',
 'Woburn',
 'Leaside',
 'Central Bay Street',
 'Christie',
 'Cedarbrae',
 'Hillcrest Village',
 'Bathurst Manor',
 'Wilson Heights',
 'Downsview North',
 'Thorncliffe Park',
 'Richmond',
 'Adelaide',
 'King',
 'Dufferin',
 'Dovercourt Village',
 'Scarborough Village',
 'Fairview',
 'Henry Farm',
 'Oriole',
 'Northw

Run a loop to get the latitude and logiude of each neighborhood using ```Nominatim```, and add it to a new dataframe ```neigh_df```

In [94]:
# Create the dataframe to contain the neighborhood cordinate data
neigh_df = pd.DataFrame(columns=['Neighborhood', 'Latitude', 'Longitude'])

# Populate the dataframe
for neighborhood in neigh_list:
    address = '{}, Toronto, ON, Canada'.format(neighborhood)
    location = geolocator.geocode(address)
    try:
        latitude = location.latitude
        longitude = location.longitude
    except:
        latitude = np.nan
        longitude = np.nan
    neigh_df = neigh_df.append({'Neighborhood': neighborhood,
                                'Latitude': latitude,
                                'Longitude': longitude}, ignore_index=True)
    print('The geograpical coordinate of {} are {}, {}.'.format(neighborhood, latitude, longitude))

The geograpical coordinate of Parkwoods are 43.7587999, -79.3201966.
The geograpical coordinate of Victoria Village are 43.732658, -79.3111892.
The geograpical coordinate of Regent Park are 43.6607056, -79.3604569.
The geograpical coordinate of Harbourfront are 43.6400801, -79.3801495.
The geograpical coordinate of Lawrence Manor are 43.7220788, -79.4375067.
The geograpical coordinate of Lawrence Heights are 43.7227784, -79.4509332.
The geograpical coordinate of Ontario Provincial Government are nan, nan.
The geograpical coordinate of Islington Avenue are 43.6389593, -79.5210499.
The geograpical coordinate of Malvern are 43.8091955, -79.2217008.
The geograpical coordinate of Rouge are 43.8049304, -79.1658374.
The geograpical coordinate of Don Mills North are 43.775347, -79.3459439.
The geograpical coordinate of Parkview Hill are 43.7062977, -79.3219073.
The geograpical coordinate of Woodbine Gardens are 43.7120785, -79.3025673.
The geograpical coordinate of Garden District are 43.65649

In [95]:
neigh_df.loc[neigh_df['Latitude'].isna()]

Unnamed: 0,Neighborhood,Latitude,Longitude
6,Ontario Provincial Government,,
37,Caledonia-Fairbanks,,
102,Keelsdale and Silverthorn,,
128,North Midtown,,
132,Enclave of L4W,,
169,Humber Bay Shores,,
175,Beaumond Heights,,
202,Enclave of M4L,,


The geo look up failed on the 8 neighborhoods above, I assume because their addressess don't fit the same pattern as the rest (ie. ```'{Neighborhood}, Toronto, ON, Canada'```), or because they represent placeholder names in the original table.  Rather than try to find the correct address for each of them, I decided to simply drop them. While I was at it, I dropped two duplicate rows as well.

In [96]:
print(neigh_df.shape)
neigh_df.dropna(inplace=True)
print(neigh_df.shape)
neigh_df.drop_duplicates(inplace=True)
print(neigh_df.shape)

(216, 3)
(208, 3)
(206, 3)


### Plotting the map of the neighborhoods using the newly created dataframe. 

In [97]:
# Create map of Toronto using latitude and longitude values
# Toronto is located at 43.6532° N, 79.3832° W according to a quick search
toronto_map = folium.Map(location=[43.6532, -79.3832], zoom_start=11)

# Add markers to map
# Code taken with slight alterations from lab 3-3-2. No need to reinvent the wheel
for lat, lng, neighborhood in zip(neigh_df['Latitude'], neigh_df['Longitude'], neigh_df['Neighborhood']):
    label = neighborhood
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
    
toronto_map

### Exploring nearby venues in Foursquare

*Hidden cell containing Foursquare credentials*

In [98]:
# The code was removed by Watson Studio for sharing.

Here I borrow the ```getNearbyVenues``` function from the 3-3-2 lab again. As before, no need to reinvent to wheel.  
It will return the top venues within 500 meters of each neighborhood input, up to 100 venues per neighborhood, based on Foursquares recomendation system. 
It uses the Foursquare API *Get* method with the *Explore* endpoint.

In [99]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Then run that function on the entire neighborhood dataframe

In [120]:
toronto_venues = getNearbyVenues(names=neigh_df['Neighborhood'], latitudes=neigh_df['Latitude'], longitudes=neigh_df['Longitude'])

In [151]:
toronto_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.758800,-79.320197,Allwyn's Bakery,43.759840,-79.324719,Caribbean Restaurant
1,Parkwoods,43.758800,-79.320197,LCBO,43.757774,-79.314257,Liquor Store
2,Parkwoods,43.758800,-79.320197,Shoppers Drug Mart,43.760857,-79.324961,Pharmacy
3,Parkwoods,43.758800,-79.320197,Petro-Canada,43.757950,-79.315187,Gas Station
4,Parkwoods,43.758800,-79.320197,Pizza Pizza,43.760231,-79.325666,Pizza Place
...,...,...,...,...,...,...,...
5737,Royal York South West,43.648183,-79.511296,Tim Hortons,43.646678,-79.513700,Coffee Shop
5738,Royal York South West,43.648183,-79.511296,Rogers,43.647080,-79.511550,Mobile Phone Shop
5739,Royal York South West,43.648183,-79.511296,Kaos Music Centre,43.647005,-79.513145,Music Store
5740,Royal York South West,43.648183,-79.511296,Gabby's Grill & Taps,43.648452,-79.506482,Bar


In [75]:
venue_counts = toronto_venues.groupby('Neighborhood').count()[['Venue']]
venue_counts.rename({'Venue': 'Venue Count'},inplace=True)
venue_counts

NameError: name 'toronto_venues' is not defined

In [150]:
print("Neighborhoods with more than 50 venues within 500 meters", venue_counts[venue_counts['Venue'] > 50].shape[0])
print("Neighborhoods with less than 50 venues within 500 meters", venue_counts[venue_counts['Venue'] < 50].shape[0])
print("Neighborhoods with less than 5 venues within 500 meters", venue_counts[venue_counts['Venue'] < 5].shape[0])

Neighborhoods with more than 50 venues within 500 meters 40
Neighborhoods with less than 50 venues within 500 meters 164
Neighborhoods with less than 5 venues within 500 meters 41


Unlike with the Manhattan Dataframe, a large portion of our neighborhoods have very few venues within 500 meters. There are more neighborhoods with less than 5 within the set range than there are with 50 or more.  This may make clustering them the same way we did the New York data inconsisent, due to the large amount of variance. 

To resolve this, I first decided to try extending the radius of search to 1 kilometer (or rather, 999 meters, since the api call would fail on 1000), due to Toronto's larger and more spread out lay out compared to New York. 

In [100]:
toronto_venues_1k = getNearbyVenues(names=neigh_df['Neighborhood'], latitudes=neigh_df['Latitude'], longitudes=neigh_df['Longitude'], radius=999)

In [142]:
venue_counts_1k = toronto_venues_1k.groupby('Neighborhood').count()[['Venue']]
venue_counts_1k.rename({'Venue':'Venue Count'},axis=1,inplace=True)
venue_counts_1k

Unnamed: 0_level_0,Venue Count
Neighborhood,Unnamed: 1_level_1
Adelaide,100
Agincourt,38
Agincourt North,30
Albion Gardens,21
Alderwood,22
Bathurst Manor,18
Bathurst Quay,100
Bayview Village,49
Bedford Park,68
Berczy Park,100


In [143]:
print("Neighborhoods with 100 or more venues within 999 meters", venue_counts_1k[venue_counts_1k['Venue Count'] == 100].shape[0])
print("Neighborhoods with more than 50 venues within 999 meters", venue_counts_1k[venue_counts_1k['Venue Count'] > 50].shape[0])
print("Neighborhoods with less than 50 venues within 999 meters", venue_counts_1k[venue_counts_1k['Venue Count'] < 50].shape[0])
print("Neighborhoods with less than 20 venues within 999 meters", venue_counts_1k[venue_counts_1k['Venue Count'] < 20].shape[0])
print("Neighborhoods with less than 5 venues within 999 meters", venue_counts_1k[venue_counts_1k['Venue Count'] < 5].shape[0])

Neighborhoods with 100 or more venues within 999 meters 47
Neighborhoods with more than 50 venues within 999 meters 84
Neighborhoods with less than 50 venues within 999 meters 122
Neighborhoods with less than 20 venues within 999 meters 53
Neighborhoods with less than 5 venues within 999 meters 3


With a much more reasonable distribution, I decided to stick with this one.  Still, calculating the top 10 most common venue types with less than 20 venues doesn't seem very meaningful.  I'm proceeding with the clustering using this data, but chose not to evaluate the top 10 for each neighborhood individually.  When Clusters are finished later in, I'll computer the top 10 venues in each along with the average total venue count. 

In [145]:
# one hot encoding
toronto_onehot = (pd.get_dummies(toronto_venues_1k[['Venue Category']], prefix="", prefix_sep=""))
# The Neighborhood column ended up somewhere in the middle of the pile, so droping it and inserting it again at the begining was the easiest way to clean up the table.
toronto_onehot.drop('Neighborhood', axis=1, inplace=True)
toronto_onehot.insert(0,'Neighborhood', toronto_venues_1k['Neighborhood'])
toronto_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,Auto Garage,Automotive Shop,BBQ Joint,Baby Store,Badminton Court,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Court,Basketball Stadium,Bath House,Beach,Beach Bar,Beer Bar,Beer Store,Belgian Restaurant,Big Box Store,Bike Shop,Bistro,Boat or Ferry,Bookstore,Botanical Garden,Boutique,Bowling Alley,Boxing Gym,...,Sporting Goods Shop,Sports Bar,Sports Club,Sri Lankan Restaurant,Stationery Store,Steakhouse,Storage Facility,Street Art,Supermarket,Supplement Shop,Sushi Restaurant,Swiss Restaurant,Syrian Restaurant,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tattoo Parlor,Tea Room,Tennis Court,Tennis Stadium,Tex-Mex Restaurant,Thai Restaurant,Theater,Theme Park,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Tourist Information Center,Toy / Game Store,Track,Trail,Train Station,Tunnel,Turkish Restaurant,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit
0,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


With the wide variation in the number of venues nearby, I felt that the total number of venues found would be a relevant feature to evaluate by, so I added that total in.

In [147]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped = toronto_grouped.merge(venue_counts_1k, how='left', on='Neighborhood')
toronto_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,Auto Garage,Automotive Shop,BBQ Joint,Baby Store,Badminton Court,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Court,Basketball Stadium,Bath House,Beach,Beach Bar,Beer Bar,Beer Store,Belgian Restaurant,Big Box Store,Bike Shop,Bistro,Boat or Ferry,Bookstore,Botanical Garden,Boutique,Bowling Alley,Boxing Gym,...,Sports Bar,Sports Club,Sri Lankan Restaurant,Stationery Store,Steakhouse,Storage Facility,Street Art,Supermarket,Supplement Shop,Sushi Restaurant,Swiss Restaurant,Syrian Restaurant,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tattoo Parlor,Tea Room,Tennis Court,Tennis Stadium,Tex-Mex Restaurant,Thai Restaurant,Theater,Theme Park,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Tourist Information Center,Toy / Game Store,Track,Trail,Train Station,Tunnel,Turkish Restaurant,University,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo Exhibit,Venue Count
0,Adelaide,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,38
2,Agincourt North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,30
3,Albion Gardens,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21
4,Alderwood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22


### Clustering

In [148]:
# I chose 7 clusters for my K value
kclusters = 7

# I drop the Neighborhood label and then apply a standard scalar, to account for the different scale of the total Venue Count
toronto_grouped_clustering = scale(toronto_grouped.drop('Neighborhood', 1))

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=4).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 2, 2, 0, 0, 0, 2, 0, 2, 4], dtype=int32)

### Mapping the Clusters

In [149]:
# add clustering labels to the original neigh_df for mapping
map_df = neigh_df
map_df['Cluster Label'] = kmeans.labels_

map_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Label
0,Parkwoods,43.7588,-79.320197,4
1,Victoria Village,43.732658,-79.311189,2
2,Regent Park,43.660706,-79.360457,2
3,Harbourfront,43.64008,-79.38015,0
4,Lawrence Manor,43.722079,-79.437507,0


With the neighborhoods all clustered, I map each neighborhood with folium using color to indicate the clusters, borowing the mapping code from the lab once more.

In [135]:
# create map
clusters_map = folium.Map(location=[43.6532, -79.3832], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(map_df['Latitude'], map_df['Longitude'], map_df['Neighborhood'], map_df['Cluster Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(clusters_map)

clusters_map

The Clusters don't appear to have much in common in terms of geographic location at first glance.  Out of curiosity, I plotted the geographic center of each cluster on the map as well.

In [136]:
cluster_centers = map_df.groupby('Cluster Label')[['Latitude','Longitude']].mean().reset_index()
cluster_centers

Unnamed: 0,Cluster Label,Latitude,Longitude
0,0,43.708989,-79.406957
1,1,43.680461,-79.505196
2,2,43.694276,-79.401217
3,3,43.726328,-79.415904
4,4,43.698451,-79.415555
5,5,43.749045,-79.289843
6,6,43.6901,-79.363592


In [137]:
# add geographical average of each cluster's latitude and logitude to the map, 
markers_colors = []
for lat, lon, cluster in zip(cluster_centers['Latitude'], cluster_centers['Longitude'], cluster_centers['Cluster Label']):
    label = folium.Popup(' Cluster Center ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=15,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(clusters_map)

clusters_map

In [160]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [164]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
cluster_venues_sorted = pd.DataFrame(columns=columns)
cluster_venues_sorted['Neighborhood'] = map_df['Neighborhood']
cluster_venues_sorted['Cluster Label'] = map_df['Cluster Label']
for ind in np.arange(map_df.shape[0]):
    cluster_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.drop(['Venue Count'], axis=1).iloc[ind, :], num_top_venues)

cluster_venues_sorted['Venue Count'] = toronto_grouped['Venue Count']
cluster_venues_sorted.head()

ValueError: Must have equal len keys and value when setting with an iterable