**Building a dataframe**

Source: https://www.coursera.org/learn/applied-data-science-capstone/peer/I1bDq/segmenting-and-clustering-neighborhoods-in-toronto/submit

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

3. To create the dataframe:

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

4. Submit a link to your Notebook on your Github repository. (10 marks)

**Note**: There are different website scraping libraries and packages in Python. For scraping the above table, you can simply use pandas to read the table into a pandas dataframe.

Another way, which would help to learn for more complicated cases of web scraping is using the BeautifulSoup package. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/

The package is so popular that there is a plethora of tutorials and examples on how to use it. Here is a very good Youtube video on how to use the BeautifulSoup package: https://www.youtube.com/watch?v=ng2o98k983k

Use pandas, or the BeautifulSoup package, or any other way you are comfortable with to transform the data in the table on the Wikipedia page into the above pandas dataframe.




In [2]:
import os
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

# Load the dataset

In [3]:
file_input_path = "./data/Toronto_FSA.csv"
df = None

if os.path.exists(file_input_path):    
    print("Loading from saved csv '%s' that was downloaded from Wikipedia page" % file_input_path)
    df = pd.read_csv(file_input_path, header=0)
else:
    print("Loading table Toronto FSA from Wikipedia page")
    
    url_page = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
    tables = pd.read_html(url_page)
    
    print("The number of tables in given Wikipedia page : %s" % len(tables))    
    df = tables[0]
    
    print("Save to file csv: %s" % file_input_path)
    df.to_csv(file_input_path, header=True, index=False)

Loading from saved csv './data/Toronto_FSA.csv' that was downloaded from Wikipedia page


# Process the dataframe

In [9]:
# use the .shape method to print the number of rows of the dataframe
print("(row, column) = %s" % str(df.shape))

(row, column) = (287, 3)


In [5]:
df.info() # The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287 entries, 0 to 286
Data columns (total 3 columns):
Postcode         287 non-null object
Borough          287 non-null object
Neighbourhood    287 non-null object
dtypes: object(3)
memory usage: 6.8+ KB


In [6]:
df.describe()

Unnamed: 0,Postcode,Borough,Neighbourhood
count,287,287,287
unique,180,12,208
top,M9V,Not assigned,Not assigned
freq,8,77,78


In [7]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [8]:
df.columns

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned. More than one neighborhood can exist in one postal code area. 

In [10]:
COL_NAME_POSTCODE = "Postcode"
COL_NAME_BOROUGH = "Borough"
COL_NAME_NEIGHBOURHOOD = "Neighbourhood"

CONST_NOT_ASSIGNED = "Not assigned"

In [11]:
df = df[df[COL_NAME_BOROUGH] != CONST_NOT_ASSIGNED]

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 210 entries, 2 to 285
Data columns (total 3 columns):
Postcode         210 non-null object
Borough          210 non-null object
Neighbourhood    210 non-null object
dtypes: object(3)
memory usage: 6.6+ KB


In [15]:
print("(row, column) = %s" % str(df.shape))
# before: (row, column) = (287, 3)
# after: (row, column) = (210, 3)

(row, column) = (210, 3)


In [16]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Combing the neighborhoods having the same postcode

In [17]:
df_combine = df.groupby(by=[COL_NAME_POSTCODE, 
                            COL_NAME_BOROUGH]).agg(lambda x: ','.join(x)).reset_index()

In [25]:
df_combine.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


For example, in the table on the Wikipedia page, you will notice that M5B is listed twice and has two neighborhoods: Ryerson and Garden District. These two rows will be combined into one row with the neighborhoods separated with a comma. 

In [35]:
df_combine[(df_combine[COL_NAME_POSTCODE]=="M5B")]

Unnamed: 0,Postcode,Borough,Neighbourhood
54,M5B,Downtown Toronto,"Ryerson,Garden District"


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [36]:
df_combine_filter = df_combine[df_combine[COL_NAME_NEIGHBOURHOOD]==CONST_NOT_ASSIGNED]
df_combine_filter

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Not assigned


In [37]:
df_combine[COL_NAME_NEIGHBOURHOOD] = df_combine.apply(
    lambda row: row[COL_NAME_BOROUGH] 
        if row[COL_NAME_NEIGHBOURHOOD]==CONST_NOT_ASSIGNED 
        else row[COL_NAME_NEIGHBOURHOOD], axis=1)  # axis=1: column

In [39]:
df_combine[(df_combine[COL_NAME_POSTCODE]=="M7A")]

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park


In [40]:
df_combine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
Postcode         103 non-null object
Borough          103 non-null object
Neighbourhood    103 non-null object
dtypes: object(3)
memory usage: 2.5+ KB


In [45]:
# Get the shape of the dataframe
print("(row, column) = %s" % str(df_combine.shape))

(row, column) = (103, 3)


Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking postal code M5G as an example, your code would look something like this:

# Get the latitude and the longitude coordinates of each neighborhood

In [46]:
COL_NAME_POSTAL_CODE = "Postal Code"
COL_NAME_LATITUDE = "Latitude"
COL_NAME_LONGITUDE = "Longitude"

file_input_path = "./data/Geospatial_Coordinates.csv"

# Loading file csv
df_coordinates = pd.read_csv(file_input_path, header=0)

In [47]:
print("(row, column) = %s" % str(df_coordinates.shape))

(row, column) = (103, 3)


In [48]:
df_coordinates.head(3)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711


**Merging two dataframes**

In [49]:
df_combine.columns

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

In [50]:
df_coordinates.columns

Index(['Postal Code', 'Latitude', 'Longitude'], dtype='object')

In [51]:
df_merged = pd.merge(df_combine, df_coordinates, 
                     left_on=COL_NAME_POSTCODE, right_on=COL_NAME_POSTAL_CODE,
                     how="inner")

In [52]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 0 to 102
Data columns (total 6 columns):
Postcode         103 non-null object
Borough          103 non-null object
Neighbourhood    103 non-null object
Postal Code      103 non-null object
Latitude         103 non-null float64
Longitude        103 non-null float64
dtypes: float64(2), object(4)
memory usage: 5.6+ KB


In [53]:
df_merged.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476


In [54]:
df_merged.columns

Index(['Postcode', 'Borough', 'Neighbourhood', 'Postal Code', 'Latitude',
       'Longitude'],
      dtype='object')

In [55]:
# Drop the same column of postal code
df_merged.drop([COL_NAME_POSTAL_CODE], axis=1, inplace=True)

In [56]:
df_merged.columns

Index(['Postcode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude'], dtype='object')

In [58]:
# Get the shape of the merged dataframe
print("(row, column) = %s" % str(df_merged.shape))

(row, column) = (103, 5)


# Explore and cluster the neighborhoods in Toronto

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

* to add enough Markdown cells to explain what you decided to do and to report any observations you make.
* to generate maps to visualize your neighborhoods and how they cluster together.
Once you are happy with your analysis, submit a link to the new Notebook on your Github repository. (3 marks)

In [62]:
df_merged[COL_NAME_BOROUGH].unique()

array(['Scarborough', 'North York', 'East York', 'East Toronto',
       'Central Toronto', 'Downtown Toronto', 'York', 'West Toronto',
       "Queen's Park", 'Mississauga', 'Etobicoke'], dtype=object)

In [65]:
# Filter with Toronto
CONST_BOROUGH_TORONTO = "Toronto"
df = df_merged[df_merged[COL_NAME_BOROUGH].str.contains(CONST_BOROUGH_TORONTO, case=False, regex=False)]

In [60]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39 entries, 37 to 93
Data columns (total 5 columns):
Postcode         39 non-null object
Borough          39 non-null object
Neighbourhood    39 non-null object
Latitude         39 non-null float64
Longitude        39 non-null float64
dtypes: float64(2), object(3)
memory usage: 1.8+ KB


In [64]:
df # 39 records

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
47,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.38316
49,M4V,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",43.686412,-79.400049


In [66]:
# Listing distinct borough
df[COL_NAME_BOROUGH].unique()

array(['East Toronto', 'Central Toronto', 'Downtown Toronto',
       'West Toronto'], dtype=object)

print('The dataframe has {} distinct boroughs and {} neighborhoods.'.format(len(df[COL_NAME_BOROUGH].unique()),
      df.shape[0]))

# Use the geopy library to get the latitude and longitude values of Toronto

In [71]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

address = 'Toronto, CA'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


**Creating a map of Toronto with neighborhoods**

In [77]:
import folium

# create map of Toronto using latitude and longitude values
plan = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df[COL_NAME_LATITUDE], 
                                           df[COL_NAME_LONGITUDE], 
                                           df[COL_NAME_BOROUGH], 
                                           df[COL_NAME_NEIGHBOURHOOD]):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='yellow',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(plan)  
    
plan

**Creating a map of Central Toronto**

In [78]:
CONST_BOROUGH = "Central Toronto"
df_central_toronto = df[df[COL_NAME_BOROUGH]==CONST_BOROUGH].reset_index(drop=True)
df_central_toronto

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
1,M4P,Central Toronto,Davisville North,43.712751,-79.390197
2,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
3,M4S,Central Toronto,Davisville,43.704324,-79.38879
4,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.38316
5,M4V,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",43.686412,-79.400049
6,M5N,Central Toronto,Roselawn,43.711695,-79.416936
7,M5P,Central Toronto,"Forest Hill North,Forest Hill West",43.696948,-79.411307
8,M5R,Central Toronto,"The Annex,North Midtown,Yorkville",43.67271,-79.405678


In [80]:
# create map of Toronto using latitude and longitude values
plan = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_central_toronto[COL_NAME_LATITUDE], 
                                           df_central_toronto[COL_NAME_LONGITUDE], 
                                           df_central_toronto[COL_NAME_BOROUGH], 
                                           df_central_toronto[COL_NAME_NEIGHBOURHOOD]):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(plan)  
    
plan

# Define the Foursquare Credentials and Versio

In [117]:
CLIENT_ID = 'X'     # Foursquare ID
CLIENT_SECRET = 'X' # Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: X
CLIENT_SECRET:X


In [82]:
df_central_toronto.loc[0, COL_NAME_NEIGHBOURHOOD]

'Lawrence Park'

In [83]:
# Get the neighborhood's latitude and longitude values
neighborhood_latitude = df_central_toronto.loc[0, COL_NAME_LATITUDE]   # neighborhood latitude value
neighborhood_longitude = df_central_toronto.loc[0, COL_NAME_LONGITUDE] # neighborhood longitude value

neighborhood_name = df_central_toronto.loc[0, COL_NAME_NEIGHBOURHOOD] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Lawrence Park are 43.7280205, -79.3887901.


Now, let's get the top 100 venues that are in Marble Hill within a radius of 500 meters.

In [118]:
# First, let's create the GET request URL. Name your URL url.

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=X&client_secret=X&v=20180604&ll=43.7280205,-79.3887901&radius=500&limit=100'

In [89]:
# Send the GET request and examine the resutls
import requests # library to handle requests

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e60f405bae9a2001c830699'},
  'headerLocation': 'Toronto',
  'headerFullLocation': 'Toronto',
  'headerLocationGranularity': 'city',
  'totalResults': 3,
  'suggestedBounds': {'ne': {'lat': 43.7325205045, 'lng': -79.3825744605273},
   'sw': {'lat': 43.7235204955, 'lng': -79.3950057394727}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '50e6da19e4b0d8a78a0e9794',
       'name': 'Lawrence Park Ravine',
       'location': {'address': '3055 Yonge Street',
        'crossStreet': 'Lawrence Avenue East',
        'lat': 43.72696303913755,
        'lng': -79.39438246708775,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.72696303913755,
          'lng': -79.39438246708775}],
        'distance': 465,
        'cc': 'CA',
  

In [90]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [91]:
# Save to dataframe
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Lawrence Park Ravine,Park,43.726963,-79.394382
1,Zodiac Swim School,Swim School,43.728532,-79.38286
2,TTC Bus #162 - Lawrence-Donway,Bus Line,43.728026,-79.382805


In [92]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

3 venues were returned by Foursquare.


# Explore all neighborhoods of Central Toronto

In [93]:
# Create a function to repeat the same process to all the neighborhoods
COL_NAME_VENUE = "Venue"
COL_NAME_CATEGORY = "Category"

COL_NAME_NEIGHBOURHOOD_LATITUDE = COL_NAME_NEIGHBOURHOOD + " " + COL_NAME_LATITUDE
COL_NAME_NEIGHBOURHOOD_LONGITUDE = COL_NAME_NEIGHBOURHOOD + " " + COL_NAME_LONGITUDE
COL_NAME_VENUE_LATITUDE = COL_NAME_VENUE + " " + COL_NAME_LATITUDE
COL_NAME_VENUE_LONGITUDE = COL_NAME_VENUE + " " + COL_NAME_LONGITUDE
COL_NAME_VENUE_CATEGORY = COL_NAME_VENUE + " " + COL_NAME_CATEGORY


def get_near_by_venues(names, latitudes, longitudes, radius=500):    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = [COL_NAME_NEIGHBOURHOOD, 
                             COL_NAME_NEIGHBOURHOOD_LATITUDE,
                             COL_NAME_NEIGHBOURHOOD_LONGITUDE,
                             COL_NAME_VENUE,
                             COL_NAME_VENUE_LATITUDE,
                             COL_NAME_VENUE_LONGITUDE,
                             COL_NAME_VENUE_CATEGORY]
    return(nearby_venues)

In [94]:
# dataframe that contains all the neighborhoods of Central Toronto
venues_central_toronto = get_near_by_venues(
    names=df_central_toronto[COL_NAME_NEIGHBOURHOOD],
    latitudes=df_central_toronto[COL_NAME_LATITUDE],                           
    longitudes=df_central_toronto[COL_NAME_LONGITUDE])

Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park,Summerhill East
Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West
Roselawn
Forest Hill North,Forest Hill West
The Annex,North Midtown,Yorkville


In [95]:
# Get the shape of the dataframe
print("(row, column) = %s" % str(venues_central_toronto.shape))
venues_central_toronto.head()

(row, column) = (106, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Lawrence Park,43.72802,-79.38879,Lawrence Park Ravine,43.726963,-79.394382,Park
1,Lawrence Park,43.72802,-79.38879,Zodiac Swim School,43.728532,-79.38286,Swim School
2,Lawrence Park,43.72802,-79.38879,TTC Bus #162 - Lawrence-Donway,43.728026,-79.382805,Bus Line
3,Davisville North,43.712751,-79.390197,Summerhill Market North,43.715499,-79.392881,Food & Drink Shop
4,Davisville North,43.712751,-79.390197,Sherwood Park,43.716551,-79.387776,Park


In [96]:
# check how many venues were returned for each neighborhood
venues_central_toronto.groupby(COL_NAME_NEIGHBOURHOOD).count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Davisville,35,35,35,35,35,35
Davisville North,8,8,8,8,8,8
"Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West",15,15,15,15,15,15
"Forest Hill North,Forest Hill West",4,4,4,4,4,4
Lawrence Park,3,3,3,3,3,3
"Moore Park,Summerhill East",1,1,1,1,1,1
North Toronto West,18,18,18,18,18,18
Roselawn,1,1,1,1,1,1
"The Annex,North Midtown,Yorkville",21,21,21,21,21,21


In [98]:
print('There are {} distinct categories.'.format(
    len(venues_central_toronto[COL_NAME_VENUE_CATEGORY].unique())))

There are 58 distinct categories.


# Analyze Each Neighborhood of Central Toronto

In [99]:
# one hot encoding
central_toronto_onehot = pd.get_dummies(venues_central_toronto[[COL_NAME_VENUE_CATEGORY]], 
                                        prefix="", 
                                        prefix_sep="")

# add neighborhood column back to dataframe
central_toronto_onehot[COL_NAME_NEIGHBOURHOOD] = venues_central_toronto[COL_NAME_NEIGHBOURHOOD] 

# move neighborhood column to the first column
fixed_columns = [central_toronto_onehot.columns[-1]] + list(central_toronto_onehot.columns[:-1])
central_toronto_onehot = central_toronto_onehot[fixed_columns]

central_toronto_onehot.head()

Unnamed: 0,Neighbourhood,American Restaurant,BBQ Joint,Bagel Shop,Breakfast Spot,Brewery,Burger Joint,Bus Line,Café,Chinese Restaurant,...,Sports Bar,Supermarket,Sushi Restaurant,Swim School,Thai Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Yoga Studio
0,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Lawrence Park,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,Lawrence Park,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Davisville North,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Davisville North,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [100]:
# Get the shape of the dataframe
central_toronto_onehot.shape

(106, 59)

In [102]:
# group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
central_toronto_grouped = central_toronto_onehot.groupby(COL_NAME_NEIGHBOURHOOD).mean().reset_index()
central_toronto_grouped

Unnamed: 0,Neighbourhood,American Restaurant,BBQ Joint,Bagel Shop,Breakfast Spot,Brewery,Burger Joint,Bus Line,Café,Chinese Restaurant,...,Sports Bar,Supermarket,Sushi Restaurant,Swim School,Thai Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Yoga Studio
0,Davisville,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.057143,0.0,...,0.0,0.0,0.057143,0.0,0.057143,0.028571,0.0,0.0,0.0,0.0
1,Davisville North,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",0.066667,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,...,0.066667,0.066667,0.066667,0.0,0.0,0.0,0.0,0.0,0.066667,0.0
3,"Forest Hill North,Forest Hill West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.25,0.0,0.0,0.0,0.25,0.0,0.0,0.0
4,Lawrence Park,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,...,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0
5,"Moore Park,Summerhill East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,North Toronto West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.055556,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556
7,Roselawn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"The Annex,North Midtown,Yorkville",0.047619,0.047619,0.0,0.0,0.0,0.047619,0.0,0.142857,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047619,0.0,0.0


In [104]:
# Get the shape of the dataframe
central_toronto_grouped.shape

(9, 59)

In [105]:
# Print each neighborhood along with the top 5 most common venues
num_top_venues = 5
COL_NAME_FREQUENCE = 'freq'

for hood in central_toronto_grouped[COL_NAME_NEIGHBOURHOOD]:
    print("----"+hood+"----")
    temp = central_toronto_grouped[central_toronto_grouped[COL_NAME_NEIGHBOURHOOD] == hood].T.reset_index()
    temp.columns = [COL_NAME_VENUE, COL_NAME_FREQUENCE]
    temp = temp.iloc[1:]
    temp[COL_NAME_FREQUENCE] = temp[COL_NAME_FREQUENCE].astype(float)
    temp = temp.round({COL_NAME_FREQUENCE: 2})
    print(temp.sort_values(COL_NAME_FREQUENCE, ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Davisville----
                Venue  freq
0      Sandwich Place  0.09
1         Pizza Place  0.09
2        Dessert Shop  0.09
3                 Gym  0.06
4  Italian Restaurant  0.06


----Davisville North----
               Venue  freq
0                Gym  0.12
1              Hotel  0.12
2   Department Store  0.12
3       Dance Studio  0.12
4  Food & Drink Shop  0.12


----Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West----
                 Venue  freq
0                  Pub  0.13
1          Coffee Shop  0.13
2  American Restaurant  0.07
3           Restaurant  0.07
4          Pizza Place  0.07


----Forest Hill North,Forest Hill West----
                 Venue  freq
0        Jewelry Store  0.25
1                Trail  0.25
2   Mexican Restaurant  0.25
3     Sushi Restaurant  0.25
4  American Restaurant  0.00


----Lawrence Park----
                 Venue  freq
0             Bus Line  0.33
1          Swim School  0.33
2                 Park  0.33
3  American Restaur

In [106]:
# Create a function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create the new dataframe and display the top 10 venues for each neighborhood.

In [107]:

import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = [COL_NAME_NEIGHBOURHOOD]
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted[COL_NAME_NEIGHBOURHOOD] = central_toronto_grouped[COL_NAME_NEIGHBOURHOOD]

for ind in np.arange(central_toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(central_toronto_grouped.iloc[ind, :], 
                                                                          num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Davisville,Pizza Place,Dessert Shop,Sandwich Place,Café,Gym,Thai Restaurant,Sushi Restaurant,Coffee Shop,Italian Restaurant,Park
1,Davisville North,Department Store,Food & Drink Shop,Hotel,Gym,Breakfast Spot,Dance Studio,Sandwich Place,Park,Fast Food Restaurant,Garden
2,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",Pub,Coffee Shop,American Restaurant,Light Rail Station,Pizza Place,Bagel Shop,Vietnamese Restaurant,Fried Chicken Joint,Sushi Restaurant,Supermarket
3,"Forest Hill North,Forest Hill West",Trail,Sushi Restaurant,Jewelry Store,Mexican Restaurant,Yoga Studio,Dessert Shop,Gym,Greek Restaurant,Gourmet Shop,Gas Station
4,Lawrence Park,Swim School,Bus Line,Park,Yoga Studio,Dessert Shop,Gym / Fitness Center,Gym,Greek Restaurant,Gourmet Shop,Gas Station


Clustering Neighborhoods of Central Toronto, Canada

Run k-means to cluster the neighborhood into 5 clusters.

In [108]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

central_toronto_grouped_clustering = central_toronto_grouped.drop(COL_NAME_NEIGHBOURHOOD, 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(central_toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 4, 3, 2, 0, 1, 0])

In [109]:
df_central_toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
1,M4P,Central Toronto,Davisville North,43.712751,-79.390197
2,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
3,M4S,Central Toronto,Davisville,43.704324,-79.38879
4,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.38316


Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [110]:
COL_NAME_CLUSTER_LABELS = 'Cluster Labels'

# add clustering labels
neighborhoods_venues_sorted.insert(0, COL_NAME_CLUSTER_LABELS, kmeans.labels_)

df_central_toronto_merged = df_central_toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
df_central_toronto_merged = df_central_toronto_merged.join(neighborhoods_venues_sorted.set_index(COL_NAME_NEIGHBOURHOOD), 
                                                           on=COL_NAME_NEIGHBOURHOOD)

df_central_toronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,3,Swim School,Bus Line,Park,Yoga Studio,Dessert Shop,Gym / Fitness Center,Gym,Greek Restaurant,Gourmet Shop,Gas Station
1,M4P,Central Toronto,Davisville North,43.712751,-79.390197,0,Department Store,Food & Drink Shop,Hotel,Gym,Breakfast Spot,Dance Studio,Sandwich Place,Park,Fast Food Restaurant,Garden
2,M4R,Central Toronto,North Toronto West,43.715383,-79.405678,0,Coffee Shop,Clothing Store,Yoga Studio,Spa,Fast Food Restaurant,Diner,Mexican Restaurant,Dessert Shop,Park,Restaurant
3,M4S,Central Toronto,Davisville,43.704324,-79.38879,0,Pizza Place,Dessert Shop,Sandwich Place,Café,Gym,Thai Restaurant,Sushi Restaurant,Coffee Shop,Italian Restaurant,Park
4,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.38316,2,Restaurant,Yoga Studio,Gym / Fitness Center,Gym,Greek Restaurant,Gourmet Shop,Gas Station,Garden,Fried Chicken Joint,Food & Drink Shop


Visualize the resulting clusters

In [111]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Let's get the geographical coordinates of **'Central Toronto', Canada**
address = 'Central Toronto, CA'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Central Toronto, CA are {}, {}.'.format(latitude, longitude))
# ------------------------------------------------------------------------------------------------
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_central_toronto_merged[COL_NAME_LATITUDE], 
                                  df_central_toronto_merged[COL_NAME_LONGITUDE], 
                                  df_central_toronto_merged[COL_NAME_NEIGHBOURHOOD], 
                                  df_central_toronto_merged[COL_NAME_CLUSTER_LABELS]):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

The geograpical coordinate of Central Toronto, CA are 43.653963, -79.387207.


# Examine clusters

In [112]:
# cluster 1
df_temp = df_central_toronto_merged.copy()

df_temp.loc[df_temp[COL_NAME_CLUSTER_LABELS] == 0, df_temp.columns[[1] + list(range(5, df_temp.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Central Toronto,0,Department Store,Food & Drink Shop,Hotel,Gym,Breakfast Spot,Dance Studio,Sandwich Place,Park,Fast Food Restaurant,Garden
2,Central Toronto,0,Coffee Shop,Clothing Store,Yoga Studio,Spa,Fast Food Restaurant,Diner,Mexican Restaurant,Dessert Shop,Park,Restaurant
3,Central Toronto,0,Pizza Place,Dessert Shop,Sandwich Place,Café,Gym,Thai Restaurant,Sushi Restaurant,Coffee Shop,Italian Restaurant,Park
5,Central Toronto,0,Pub,Coffee Shop,American Restaurant,Light Rail Station,Pizza Place,Bagel Shop,Vietnamese Restaurant,Fried Chicken Joint,Sushi Restaurant,Supermarket
8,Central Toronto,0,Café,Sandwich Place,Coffee Shop,Indian Restaurant,Pub,BBQ Joint,Burger Joint,Cosmetics Shop,History Museum,Liquor Store


In [113]:
# Cluster 2
df_temp.loc[df_temp[COL_NAME_CLUSTER_LABELS] == 1, df_temp.columns[[1] + list(range(5, df_temp.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Central Toronto,1,Garden,Yoga Studio,Dessert Shop,History Museum,Gym / Fitness Center,Gym,Greek Restaurant,Gourmet Shop,Gas Station,Fried Chicken Joint


In [114]:
# Cluster 3
df_temp.loc[df_temp[COL_NAME_CLUSTER_LABELS] == 2, df_temp.columns[[1] + list(range(5, df_temp.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Central Toronto,2,Restaurant,Yoga Studio,Gym / Fitness Center,Gym,Greek Restaurant,Gourmet Shop,Gas Station,Garden,Fried Chicken Joint,Food & Drink Shop


In [115]:
# Cluster 4
df_temp.loc[df_temp[COL_NAME_CLUSTER_LABELS] == 3, df_temp.columns[[1] + list(range(5, df_temp.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Central Toronto,3,Swim School,Bus Line,Park,Yoga Studio,Dessert Shop,Gym / Fitness Center,Gym,Greek Restaurant,Gourmet Shop,Gas Station


In [116]:
# Cluster 5
df_temp.loc[df_temp[COL_NAME_CLUSTER_LABELS] == 5, df_temp.columns[[1] + list(range(5, df_temp.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
