<h1> Postal Codes in Canada </h1>

<h2> Part 1 </h2>

Firstly, import all the necessary libraries. 

In [89]:
# import all necessary libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

Now I can set url and request the html text from wikipedia. Then I will print the html to see what it looks like. 


In [90]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

html = requests.get(url).text
#html

We want to parse this mess to find the tables in the website. That's where our data will be. 

In [91]:
# parse the data and find_all of the tables on the website
soup = BeautifulSoup(html, 'html5lib')
tables = soup.find_all('table')
len(tables)

3

It appears there are a few tables on the webite, so we'll have to find the correct table index to get our info. 

In [92]:
# since there are 3 tables, find out which table the data is in
for index, table in enumerate(tables):
    if ("wikitable" in str(table)):
        table_index = index
print(table_index)

0


Now, we can get all the info needed and put it into a dataframe. 
First, make an empty dataframe named <code> postal_codes </code> 

Then, find all the <code> tr </code> and the <code> td </code> within the HTML, strip the necessary parts and pop them into the dataframe. 

In [93]:
# place all of the data into a dataframe

postal_codes = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])

for row in tables[table_index].tbody.find_all("tr"): 
    col = row.find_all("td")
    if (col != []):
        code = col[0].text.strip()
        borough = col[1].text.strip()
        neighborhood = col[2].text.strip()
        postal_codes = postal_codes.append({"PostalCode":code, "Borough":borough, "Neighborhood":neighborhood}, ignore_index=True)

postal_codes

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Save only the necessary data from the Borough, and lose the data that is not assigned or we can use

In [94]:
# get ride of all the Boroughs that are not assigned
postal_codes = postal_codes[postal_codes.Borough != 'Not assigned']

In [95]:
# reset the index 
postal_codes = postal_codes.reset_index(drop=True)
postal_codes.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [105]:
postal_codes = postal_codes.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(list)

postal_codes = postal_codes.sample(frac=1).reset_index()
postal_codes['Neighborhood'] = postal_codes['Neighborhood'].str.join(',')
postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M9L,North York,Humber Summit
1,M2J,North York,"Fairview, Henry Farm, Oriole"
2,M6P,West Toronto,"High Park, The Junction South"
3,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
4,M1B,Scarborough,"Malvern, Rouge"


In [106]:
postal_codes.shape

(103, 3)

That's about it. Now we have some data that we can sort through and actually use. 

<h2> Part 2 </h2>

In [71]:
pip install geocoder

You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [72]:
'''
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_codes))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

'''

"\nimport geocoder # import geocoder\n\n# initialize your variable to None\nlat_lng_coords = None\n\n# loop until you get the coordinates\nwhile(lat_lng_coords is None):\n  g = geocoder.google('{}, Toronto, Ontario'.format(postal_codes))\n  lat_lng_coords = g.latlng\n\nlatitude = lat_lng_coords[0]\nlongitude = lat_lng_coords[1]\n\n"

In [107]:
csv = 'Geospatial_Coordinates.csv'
lat_long = pd.read_csv('/Users/jacobgood/Desktop/MOOCs for Money/Coursera/5. Applied Data Science Capstone/Week 3 - k-Means/Final Project/Geospatial_Coordinates.csv')

Let's take a look at the datasets and them merge them together. 

In [108]:
lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [109]:
postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M9L,North York,Humber Summit
1,M2J,North York,"Fairview, Henry Farm, Oriole"
2,M6P,West Toronto,"High Park, The Junction South"
3,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
4,M1B,Scarborough,"Malvern, Rouge"


In [110]:
lat_long.shape, postal_codes.shape

((103, 3), (103, 3))

Great! They're the same shape. This will be easy. Let's make sure the Postal Codes are sorted the same. 

In [111]:
lat_long = lat_long.sort_values(by=['Postal Code'])
lat_long

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [112]:
postal_codes = postal_codes.sort_values(by=['PostalCode'])
postal_codes

Unnamed: 0,PostalCode,Borough,Neighborhood
4,M1B,Scarborough,"Malvern, Rouge"
20,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
92,M1E,Scarborough,"Guildwood, Morningside, West Hill"
70,M1G,Scarborough,Woburn
37,M1H,Scarborough,Cedarbrae
...,...,...,...
87,M9N,York,Weston
22,M9P,Etobicoke,Westmount
54,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
56,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


They all match up on this page, should be fine to merge now! First we'll match the latitudes and then match the longitudes. Check the first 15 to see how we did!

In [113]:
postal_codes['Latitude'] = lat_long['Latitude']
postal_codes['Longitude'] = lat_long['Longitude']


In [114]:
postal_codes

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
4,M1B,Scarborough,"Malvern, Rouge",43.773136,-79.239476
20,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.757490,-79.374714
92,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.628841,-79.520999
70,M1G,Scarborough,Woburn,43.648429,-79.382280
37,M1H,Scarborough,Cedarbrae,43.676357,-79.293031
...,...,...,...,...,...
87,M9N,York,Weston,43.662744,-79.321558
22,M9P,Etobicoke,Westmount,43.770120,-79.408493
54,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.657162,-79.378937
56,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.644771,-79.373306


# Part 3

Import libraries that we need for this part of the project. 

In [151]:
from sklearn.cluster import KMeans
import folium
from geopy.geocoders import Nominatim

<h3> Cluster the Neighborhoods </h3>

In [115]:
# let's rename the postal_codes to toronto_data
toronto_data = postal_codes
toronto_data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
4,M1B,Scarborough,"Malvern, Rouge",43.773136,-79.239476
20,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.757490,-79.374714
92,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.628841,-79.520999
70,M1G,Scarborough,Woburn,43.648429,-79.382280
37,M1H,Scarborough,Cedarbrae,43.676357,-79.293031
...,...,...,...,...,...
87,M9N,York,Weston,43.662744,-79.321558
22,M9P,Etobicoke,Westmount,43.770120,-79.408493
54,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.657162,-79.378937
56,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.644771,-79.373306


Let's use the geocoder to get the starting location for Toronto, Canada. Save the latitude and longitude for later. 

In [116]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent='tor_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Toronto are 43.6534817, -79.3839347.


We can visualize the neighborhoods in Toronto now that we've got the Latitude and Longitude. 

In [117]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Let's just review a bit of what we're looking at. This is the data that we've identified in the map. Postal Codes, Borough, Neighborhood, Latitude, and Longitude. 

In [118]:
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
4,M1B,Scarborough,"Malvern, Rouge",43.773136,-79.239476
20,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.75749,-79.374714
92,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.628841,-79.520999
70,M1G,Scarborough,Woburn,43.648429,-79.38228
37,M1H,Scarborough,Cedarbrae,43.676357,-79.293031


Here I can use the Foursquare API to explore the neighborhoods and see what each has to offer. We'll also segment them. 

#### First let's save some data

In [119]:
CLIENT_ID = '5OMXBU3ZGRZA5BCARJRRJ12VXI4MNWEYKWMRJYM4SOOTUODX' # your Foursquare ID
CLIENT_SECRET = 'TRRWL1N24Z54AUJQZZVJESAZGBVNOD5YSD3HOKMIQLBOXWKQ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 5OMXBU3ZGRZA5BCARJRRJ12VXI4MNWEYKWMRJYM4SOOTUODX
CLIENT_SECRET:TRRWL1N24Z54AUJQZZVJESAZGBVNOD5YSD3HOKMIQLBOXWKQ


#### Now let's explore the neighborhoods in the dataframe

In [120]:
toronto_data.loc[0, 'Neighborhood']

'Humber Summit'

Get the neighborhood's latitude and longtitude

In [121]:
neighborhood_latitude = toronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = toronto_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Humber Summit are 43.806686299999996, -79.19435340000001.


#### Top 100 VENUES in the Humber Summit area within a radius of 500 meters

In [122]:
# we'll limit the venues to 100 and get a radius of 500 m

LIMIT = 100
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)

In [123]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '604cb3595c9e6424598c3df2'},
  'headerLocation': 'Humber Summit',
  'headerFullLocation': 'Humber Summit, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 3,
  'suggestedBounds': {'ne': {'lat': 43.7608033045, 'lng': -79.55974472346225},
   'sw': {'lat': 43.7518032955, 'lng': -79.57218187653774}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '54fb21be498e513a0a853128',
       'name': 'HNS ARARAT',
       'location': {'address': '153 Milvan Dr #2B',
        'lat': 43.75751884568645,
        'lng': -79.5636534690857,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.75751884568645,
          'lng': -79.5636534690857}],
        'distance': 229,
        'postalCode': 'M9L 1Z8',
        'cc': 'CA

I'm borrowing a get_category_type here from our labs in the previous section while using Foursquare API. 

In [124]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Clean the json we just downloaded and then pop it into a pandas dataframe

In [125]:
# I had to specifically import the json_normalize for some reason
from pandas.io.json import json_normalize

In [126]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,HNS ARARAT,Furniture / Home Store,43.757519,-79.563653
1,Islington & Finch,Intersection,43.754646,-79.568638
2,Dryshield Waterproofing,Construction & Landscaping,43.758994,-79.561762


In [127]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

3 venues were returned by Foursquare.


That doesn't look like a promising figure for entertainment. We'll just get all the neighborhoods and venues in those neighborhoods. This will be a bunch. 

In [128]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [130]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'], 
                                  latitudes=toronto_data['Latitude'], 
                                  longitudes=toronto_data['Longitude'])

Malvern, Rouge
Rouge Hill, Port Union, Highland Creek
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
Kennedy Park, Ionview, East Birchmount Park
Golden Mile, Clairlea, Oakridge
Cliffside, Cliffcrest, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Wexford Heights, Scarborough Town Centre
Wexford, Maryvale
Agincourt
Clarks Corners, Tam O'Shanter, Sullivan
Milliken, Agincourt North, Steeles East, L'Amoreaux East
Steeles West, L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
York Mills, Silver Hills
Willowdale, Newtonbrook
Willowdale, Willowdale East
York Mills West
Willowdale, Willowdale West
Parkwoods
Don Mills
Don Mills
Bathurst Manor, Wilson Heights, Downsview North
Northwood Park, York University
Downsview
Downsview
Downsview
Downsview
Victoria Village
Parkview Hill, Woodbine Gardens
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto, Broadview North (Old East York)
The Danforth West, 

In [131]:
toronto_data['Neighborhood'].value_counts()

Downsview                                                                                                                                 4
Don Mills                                                                                                                                 2
Parkview Hill, Woodbine Gardens                                                                                                           1
Old Mill South, King's Mill Park, Sunnylea, Humber Bay, Mimico NE, The Queensway East, Royal York South East, Kingsway Park South East    1
Studio District                                                                                                                           1
                                                                                                                                         ..
Fairview, Henry Farm, Oriole                                                                                                              1
Davisville          

How big is the dataframe with all of the Toronto venues? Let's take a look. 

In [132]:
print(toronto_venues.shape)
toronto_venues.head()

(2124, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.773136,-79.239476,Federick Restaurant,43.774697,-79.241142,Hakka Restaurant
1,"Malvern, Rouge",43.773136,-79.239476,Drupati's Roti & Doubles,43.775222,-79.241678,Caribbean Restaurant
2,"Malvern, Rouge",43.773136,-79.239476,Thai One On,43.774468,-79.241268,Thai Restaurant
3,"Malvern, Rouge",43.773136,-79.239476,Centennial Recreation Centre,43.774593,-79.2365,Athletics & Sports
4,"Malvern, Rouge",43.773136,-79.239476,TD Canada Trust,43.77483,-79.241251,Bank


We'll look at more details of the various neighborhoods to see how many venues were returend for each neighborhood. 

In [133]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,100,100,100,100,100,100
"Alderwood, Long Branch",31,31,31,31,31,31
"Bathurst Manor, Wilson Heights, Downsview North",10,10,10,10,10,10
Bayview Village,14,14,14,14,14,14
"Bedford Park, Lawrence Manor East",3,3,3,3,3,3
...,...,...,...,...,...,...
"Willowdale, Willowdale West",4,4,4,4,4,4
Woburn,100,100,100,100,100,100
Woodbine Heights,2,2,2,2,2,2
York Mills West,9,9,9,9,9,9


#### What about the unique categories that can be curated from the venue list? 

In [134]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 274 uniques categories.


# Analyze the neighborhoods

In [136]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Check the size of the new dataframe. 

In [138]:
toronto_onehot.shape

(2124, 274)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [140]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,Agincourt,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,...,0.0,0.00,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0
1,"Alderwood, Long Branch",0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.00,0.0,0.00,0.0,0.0,0.0,0.00,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.00,0.0,0.00,0.0,0.0,0.0,0.00,0.0,0.0
3,Bayview Village,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.00,0.0,0.00,0.0,0.0,0.0,0.00,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.00,0.0,0.00,0.0,0.0,0.0,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,"Willowdale, Willowdale West",0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.00,0.0,0.00,0.0,0.0,0.0,0.00,0.0,0.0
92,Woburn,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,...,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0
93,Woodbine Heights,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.00,0.0,0.00,0.0,0.0,0.0,0.00,0.0,0.0
94,York Mills West,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,...,0.0,0.00,0.0,0.00,0.0,0.0,0.0,0.00,0.0,0.0


#### Confirm the new size

In [143]:
toronto_grouped.shape

(96, 274)

#### Print out each neighborhood with the top 5 most common venue

In [144]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
         venue  freq
0  Coffee Shop  0.13
1   Restaurant  0.07
2         Café  0.06
3        Hotel  0.05
4          Gym  0.04


----Alderwood, Long Branch----
                venue  freq
0         Coffee Shop  0.19
1               Diner  0.06
2    Sushi Restaurant  0.06
3         Yoga Studio  0.03
4  Italian Restaurant  0.03


----Bathurst Manor, Wilson Heights, Downsview North----
            venue  freq
0     Pizza Place   0.2
1             Pub   0.1
2        Pharmacy   0.1
3  Sandwich Place   0.1
4             Gym   0.1


----Bayview Village----
                         venue  freq
0               Breakfast Spot  0.14
1                    Gift Shop  0.14
2                  Coffee Shop  0.07
3  Eastern European Restaurant  0.07
4                          Bar  0.07


----Bedford Park, Lawrence Manor East----
                        venue  freq
0                        Park  0.33
1  Construction & Landscaping  0.33
2           Convenience Store  0.33
3                

4        Yoga Studio  0.00


----Queen's Park, Ontario Provincial Government----
                 venue  freq
0                 Café  0.25
1  Japanese Restaurant  0.25
2                 Bank  0.25
3   Chinese Restaurant  0.25
4  Monument / Landmark  0.00


----Regent Park, Harbourfront----
                venue  freq
0         Coffee Shop  0.18
1                Café  0.05
2  Italian Restaurant  0.05
3      Sandwich Place  0.05
4         Salad Place  0.03


----Richmond, Adelaide, King----
                 venue  freq
0          Coffee Shop  0.08
1  Japanese Restaurant  0.06
2     Sushi Restaurant  0.06
3              Gay Bar  0.04
4           Restaurant  0.04


----Rosedale----
                        venue  freq
0                        Park   1.0
1                 Yoga Studio   0.0
2                 Men's Store   0.0
3  Modern European Restaurant   0.0
4           Mobile Phone Shop   0.0


----Roselawn----
            venue  freq
0     Coffee Shop  0.11
1            Café  0.05
2     

#### Pass that info into a pandas dataframe 

First, let's write a function to sort the venues in descending order. 

In [145]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now the new dataframe and with it we'll display the top 10 venues for each neighborhood. 

In [146]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Coffee Shop,Restaurant,Café,Hotel,Italian Restaurant,Gym,Cocktail Bar,Seafood Restaurant,Japanese Restaurant,Deli / Bodega
1,"Alderwood, Long Branch",Coffee Shop,Diner,Sushi Restaurant,Yoga Studio,Bar,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Restaurant
2,"Bathurst Manor, Wilson Heights, Downsview North",Pizza Place,Sandwich Place,Skating Rink,Gym,Coffee Shop,Pharmacy,Dance Studio,Athletics & Sports,Pub,Women's Store
3,Bayview Village,Breakfast Spot,Gift Shop,Dog Run,Bar,Restaurant,Bookstore,Dessert Shop,Coffee Shop,Eastern European Restaurant,Cuban Restaurant
4,"Bedford Park, Lawrence Manor East",Park,Construction & Landscaping,Convenience Store,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore


# Here, we're going to CLUSTER the NEIGHBORHOODS

Run k-means to cluest the neighborhood into 5 clusters

In [166]:
toronto_grouped.head(50)

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Truck Stop,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0
1,"Alderwood, Long Branch",0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"Birch Cliff, Cliffside West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015873,...,0.0,0.0,0.0,0.0,0.015873,0.0,0.0,0.0,0.0,0.031746
8,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [173]:
# set number of clusters
kclusters = 3

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 2, 0, 0, 0, 2, 0], dtype=int32)

In [178]:
# add clustering labels
#neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge manhattan_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood',how ='inner')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,M1B,Scarborough,"Malvern, Rouge",43.773136,-79.239476,0,Bakery,Bank,Hakka Restaurant,Caribbean Restaurant,Athletics & Sports,Fried Chicken Joint,Thai Restaurant,Gas Station,Doner Restaurant,Distribution Center
92,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.628841,-79.520999,0,Gym,Kids Store,Discount Store,Tanning Salon,Burrito Place,Sandwich Place,Burger Joint,Bakery,Supplement Shop,Convenience Store
70,M1G,Scarborough,Woburn,43.648429,-79.38228,0,Coffee Shop,Café,Hotel,Restaurant,Japanese Restaurant,Gym,Steakhouse,Salad Place,American Restaurant,Asian Restaurant
37,M1H,Scarborough,Cedarbrae,43.676357,-79.293031,0,Trail,Health Food Store,Pub,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
65,M1J,Scarborough,Scarborough Village,43.67271,-79.405678,0,Coffee Shop,Café,Sandwich Place,Park,BBQ Joint,History Museum,Indian Restaurant,Liquor Store,Donut Shop,Middle Eastern Restaurant


In [196]:
toronto_merged['Cluster Labels'].value_counts()

0    91
1     5
2     2
4     1
3     1
Name: Cluster Labels, dtype: int64

Let's visualize this stuff man! Come on! But first, make sure you have the necessary libraries. 

In [183]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [200]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [197]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Scarborough,0,Bakery,Bank,Hakka Restaurant,Caribbean Restaurant,Athletics & Sports,Fried Chicken Joint,Thai Restaurant,Gas Station,Doner Restaurant,Distribution Center
92,Scarborough,0,Gym,Kids Store,Discount Store,Tanning Salon,Burrito Place,Sandwich Place,Burger Joint,Bakery,Supplement Shop,Convenience Store
70,Scarborough,0,Coffee Shop,Café,Hotel,Restaurant,Japanese Restaurant,Gym,Steakhouse,Salad Place,American Restaurant,Asian Restaurant
37,Scarborough,0,Trail,Health Food Store,Pub,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
65,Scarborough,0,Coffee Shop,Café,Sandwich Place,Park,BBQ Joint,History Museum,Indian Restaurant,Liquor Store,Donut Shop,Middle Eastern Restaurant
...,...,...,...,...,...,...,...,...,...,...,...,...
87,York,0,Yoga Studio,Auto Workshop,Gym / Fitness Center,Garden Center,Garden,Fast Food Restaurant,Farmers Market,Light Rail Station,Comic Shop,Pizza Place
22,Etobicoke,0,Ramen Restaurant,Coffee Shop,Pizza Place,Sandwich Place,Café,Restaurant,Plaza,Shopping Mall,Hotel,Steakhouse
54,Etobicoke,0,Clothing Store,Coffee Shop,Café,Cosmetics Shop,Italian Restaurant,Bubble Tea Shop,Hotel,Middle Eastern Restaurant,Japanese Restaurant,Fast Food Restaurant
56,Etobicoke,0,Coffee Shop,Cocktail Bar,Bakery,Seafood Restaurant,Beer Bar,Restaurant,Farmers Market,Cheese Shop,Pharmacy,Sandwich Place


#### Cluster 2

In [198]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
74,Scarborough,1,Park,Pool,Women's Store,Greek Restaurant,Deli / Bodega,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore
21,Downtown Toronto,1,Park,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,College Stadium
50,Downtown Toronto,1,Park,Playground,Trail,Ethiopian Restaurant,Escape Room,Electronics Store,Event Space,Eastern European Restaurant,Dumpling Restaurant,Department Store
40,Downtown Toronto,1,Park,Convenience Store,Dumpling Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Eastern European Restaurant
98,Etobicoke,1,Park,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,College Stadium


#### Cluster 3

In [201]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
91,Downtown Toronto,2,Baseball Field,Women's Store,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dumpling Restaurant,Fast Food Restaurant
97,Central Toronto,2,Paper / Office Supplies Store,Baseball Field,Women's Store,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant


#### Cluster 4

In [202]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,3,Bar,Women's Store,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Fast Food Restaurant


#### Cluster 5

In [203]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,4,Fast Food Restaurant,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Hakka Restaurant
