## Segmenting and Clustering Neighborhoods in Toronto

#### *All 3 questions are saved under the same ipynb file

#### 1. To create a dataframe will consist of three columns: PostalCode, Borough, and Neighborhood. following the instructions.

In [1]:
#install lxml to scrape the table
!pip install lxml    
import pandas as pd



In [2]:
#read the html webpage, scrape the tables on the webpage into a list file, check the length of the file.
raw=pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
len(raw)

3

In [3]:
#show the table we need as part of the list file
raw[0]

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
...,...,...,...
175,M5Z,Not assigned,
176,M6Z,Not assigned,
177,M7Z,Not assigned,
178,M8Z,Etobicoke,Mimico NW / The Queensway West / South of Bloo...


In [4]:
#looks like we need to flatten the complicated list into a simple list, then make into a dataframe
df=pd.DataFrame(raw[0])
df.rename(columns={'Postal code': 'PostalCode'}, inplace=True)
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge


In [5]:
#clean up the data, drop the row with Borough Not assigned, change the format in the Neighborhood column
df=df[df['Borough']!='Not assigned']
df['Neighborhood']=df['Neighborhood'].str.replace(' / ',',')
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park,Harbourfront"
5,M6A,North York,"Lawrence Manor,Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park,Ontario Provincial Government"
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern,Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill,Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


#check to see if any Borough has Neighborhood 'Not assigned'.
df['Neighborhood'].str.contains('Not').sum()

In [6]:
#reset index
df.reset_index(drop=True,inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park,Harbourfront"
3,M6A,North York,"Lawrence Manor,Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park,Ontario Provincial Government"


In [7]:
#check the volume of the data frame
df.shape

(103, 3)

#### 2. Get the latitude and the longitude coordinates of each neighborhood and add to the dataframe

In [8]:
!pip install geocoder



In [9]:
#find the Latitude Longitude based on the postal code (using geocoder and arcgis)
import geocoder
def get_geocoder(postal_code):
     # initialize your variable to None
     lat_lng_coords = None
     # loop until you get the coordinates
     while(lat_lng_coords is None):
       g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
       lat_lng_coords = g.latlng
     latitude = lat_lng_coords[0]
     longitude = lat_lng_coords[1]
     return latitude,longitude

In [12]:
#write into a list (most efficient way I have tried)then convert into a proper dataframe)
latlng_list=[]
for i in df['PostalCode'].values:
    m=list(get_geocoder(i))
    m.append(i)
    latlng_list.append(m)
df1=pd.DataFrame(latlng_list,columns=['latitude','longitude','PostalCode'])
df1.head()

Unnamed: 0,latitude,longitude,PostalCode
0,43.752935,-79.335641,M3A
1,43.728102,-79.31189,M4A
2,43.650964,-79.353041,M5A
3,43.723265,-79.451211,M6A
4,43.66179,-79.38939,M7A


In [13]:
#Merge the dataframe together.
df_final=pd.merge(df,df1,on='PostalCode')

In [14]:
df_final.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,latitude,longitude
0,M3A,North York,Parkwoods,43.752935,-79.335641
1,M4A,North York,Victoria Village,43.728102,-79.31189
2,M5A,Downtown Toronto,"Regent Park,Harbourfront",43.650964,-79.353041
3,M6A,North York,"Lawrence Manor,Lawrence Heights",43.723265,-79.451211
4,M7A,Downtown Toronto,"Queen's Park,Ontario Provincial Government",43.66179,-79.38939


#### 3. Explore and cluster the neighborhoods in Toronto.

In [None]:
#import libraries that are needed
import numpy as np 
import json 

!pip install geopy
from geopy.geocoders import Nominatim

import requests
from pandas.io.json import json_normalize

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium 
print('Libraries imported.')

In [17]:
#find Geo coordinate of city of Toronto!
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="T_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of City of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of City of Toronto are 43.6534817, -79.3839347.


now we can sort the data further to find the Borough that contains toronto and make it into a new dataframe toronto_center

In [23]:
toronto_center=df_final[df_final['Borough'].str.contains("Toronto")].reset_index(drop=True)
toronto_center.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,latitude,longitude
0,M5A,Downtown Toronto,"Regent Park,Harbourfront",43.650964,-79.353041
1,M7A,Downtown Toronto,"Queen's Park,Ontario Provincial Government",43.66179,-79.38939
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657491,-79.377529
3,M5C,Downtown Toronto,St. James Town,43.651734,-79.375554
4,M4E,East Toronto,The Beaches,43.678148,-79.295349


create map to visualize these neigborhoods on the map

In [27]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)
# add markers to map from the previous generated data frame
for lat, lng, borough, neighborhood in zip(toronto_center['latitude'], toronto_center['longitude'], toronto_center['Borough'], toronto_center['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='Orange',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Use Foursquare api now, we can define the credential and version use some varibles

In [28]:
CLIENT_ID = '3SXSKUTJPZQZUQCLQ440ESICM4VZ5AGZMQIBDF45FAJVQ2ZB'
CLIENT_SECRET = 'VY3EAI11XCAHOSDW2EMZPKP0A1VDDOP103IHJWNQOBEQ23K0'
VERSION = '20180605'

create a function to explore the venues in each neighborhood inte toronto_center data frame

In [40]:
def getNearbyVenues(names, latitudes, longitudes):
    radius=500
    LIMIT=200
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Run this function with the dataframe toronto_center

In [41]:
toronto_venues = getNearbyVenues(names=toronto_center['Neighborhood'],latitudes=toronto_center['latitude'],longitudes=toronto_center['longitude'])

Regent Park,Harbourfront
Queen's Park,Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond,Adelaide,King
Dufferin,Dovercourt Village
Harbourfront East,Union Station,Toronto Islands
Little Portugal,Trinity
The Danforth West,Riverdale
Toronto Dominion Centre,Design Exchange
Brockton,Parkdale Village,Exhibition Place
India Bazaar,The Beaches West
Commerce Court,Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park,The Junction South
North Toronto West
The Annex,North Midtown,Yorkville
Parkdale,Roncesvalles
Davisville
University of Toronto,Harbord
Runnymede,Swansea
Moore Park,Summerhill East
Kensington Market,Chinatown,Grange Park
Summerhill West,Rathnelly,South Hill,Forest Hill SE,Deer Park
CN Tower,King and Spadina,Railway Lands,Harbourfront West,Bathurst  Quay,South Niagara,Island airport
Rosedale
Stn A PO Boxes
St. James Town,Cabbagetown
First Canadian Place

In [42]:
print(toronto_venues.shape)
toronto_venues.head()

(1675, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park,Harbourfront",43.650964,-79.353041,Souk Tabule,43.653756,-79.35439,Mediterranean Restaurant
1,"Regent Park,Harbourfront",43.650964,-79.353041,Young Centre for the Performing Arts,43.650825,-79.357593,Performing Arts Venue
2,"Regent Park,Harbourfront",43.650964,-79.353041,SOMA chocolatemaker,43.650622,-79.358127,Chocolate Shop
3,"Regent Park,Harbourfront",43.650964,-79.353041,Cluny Bistro & Boulangerie,43.650565,-79.357843,French Restaurant
4,"Regent Park,Harbourfront",43.650964,-79.353041,BATLgrounds,43.647088,-79.351306,Athletics & Sports


now let's take a look at how many venues in each neighborhood

In [43]:
toronto_venues[['Neighborhood','Venue']].groupby('Neighborhood').count()

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
Berczy Park,66
"Brockton,Parkdale Village,Exhibition Place",45
Business reply mail Processing CentrE,100
"CN Tower,King and Spadina,Railway Lands,Harbourfront West,Bathurst Quay,South Niagara,Island airport",67
Central Bay Street,79
Christie,11
Church and Wellesley,86
"Commerce Court,Victoria Hotel",100
Davisville,28
Davisville North,4


In [45]:
#how many unique categories do we have here for the venues?
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 226 uniques categories.


analyse each neighberhood with one hot ecoding

In [62]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

In [63]:
toronto_onehot.shape

(1675, 226)

In [64]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,...,Trail,Train Station,Transportation Service,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.015152,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton,Parkdale Village,Exhibition Place",0.022222,0.0,0.0,0.022222,0.0,0.022222,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Business reply mail Processing CentrE,0.0,0.0,0.03,0.01,0.0,0.01,0.03,0.0,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0
3,"CN Tower,King and Spadina,Railway Lands,Harbou...",0.0,0.0,0.0,0.0,0.0,0.0,0.014925,0.0,0.0,...,0.014925,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.012658,0.0,0.012658,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.012658,0.012658,0.012658,0.0,0.0,0.0


Let's print each neighborhood along with the top 5 most common venues

In [68]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp =  toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [68]:
].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
                venue  freq
0         Coffee Shop  0.12
1          Restaurant  0.05
2        Cocktail Bar  0.05
3  Seafood Restaurant  0.05
4         Cheese Shop  0.03


----Brockton,Parkdale Village,Exhibition Place----
                    venue  freq
0             Coffee Shop  0.09
1                    Café  0.07
2              Restaurant  0.04
3  Thrift / Vintage Store  0.04
4               Gift Shop  0.04


----Business reply mail Processing CentrE----
              venue  freq
0       Coffee Shop  0.09
1        Restaurant  0.04
2               Bar  0.04
3             Hotel  0.04
4  Asian Restaurant  0.03


----CN Tower,King and Spadina,Railway Lands,Harbourfront West,Bathurst  Quay,South Niagara,Island airport----
               venue  freq
0        Coffee Shop  0.07
1         Restaurant  0.06
2               Café  0.06
3  French Restaurant  0.04
4               Park  0.04


----Central Bay Street----
                       venue  freq
0                Coffee S

Let's print each neighborhood along with the top 5 most common venues, then write the top 10 venues into a dataframe

In [71]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [74]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Restaurant,Seafood Restaurant,Farmers Market,Breakfast Spot,Cheese Shop,Café,Beer Bar,Hotel
1,"Brockton,Parkdale Village,Exhibition Place",Coffee Shop,Café,Restaurant,Gift Shop,Pizza Place,Thrift / Vintage Store,French Restaurant,Pet Store,Mexican Restaurant,Boutique
2,Business reply mail Processing CentrE,Coffee Shop,Bar,Restaurant,Hotel,American Restaurant,Asian Restaurant,Italian Restaurant,Gym,Taco Place,Japanese Restaurant
3,"CN Tower,King and Spadina,Railway Lands,Harbou...",Coffee Shop,Café,Restaurant,French Restaurant,Park,Bar,Gym / Fitness Center,Italian Restaurant,Lounge,Speakeasy
4,Central Bay Street,Coffee Shop,Clothing Store,Middle Eastern Restaurant,Japanese Restaurant,Breakfast Spot,Bubble Tea Shop,Sandwich Place,Fast Food Restaurant,Italian Restaurant,Restaurant


Cluster Neighborhoods : let's Run k-means to cluster the neighborhood into 3 clusters.

In [86]:
# set number of clusters
kclusters = 4

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:25] 

array([0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 3, 0, 0, 2, 0, 0, 2, 0, 0, 0,
       0, 0, 0])

In [91]:
neighborhoods_venues_sorted=neighborhoods_venues_sorted.drop('Cluster Labels',1)

In [92]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merge = toronto_center

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merge = toronto_merge.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

In [93]:
#check if the last column contains the cluster labels
toronto_merge.columns

Index(['PostalCode', 'Borough', 'Neighborhood', 'latitude', 'longitude',
       'Cluster Labels', '1st Most Common Venue', '2nd Most Common Venue',
       '3rd Most Common Venue', '4th Most Common Venue',
       '5th Most Common Venue', '6th Most Common Venue',
       '7th Most Common Venue', '8th Most Common Venue',
       '9th Most Common Venue', '10th Most Common Venue'],
      dtype='object')

In [95]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merge['latitude'], toronto_merge['longitude'], toronto_merge['Neighborhood'], toronto_merge['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

upon observation, cluster1 (cluster label 0 neighberhood represent the more downtown core neighborhood type)

In [97]:
toronto_merge.loc[toronto_merge['Cluster Labels'] == 0, toronto_merge.columns[[1] + list(range(5, toronto_merge.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,0,Coffee Shop,Pub,Café,Athletics & Sports,Music Venue,Theater,Seafood Restaurant,Mexican Restaurant,Food Truck,French Restaurant
1,Downtown Toronto,0,Coffee Shop,Café,Yoga Studio,Diner,Park,Middle Eastern Restaurant,Juice Bar,Italian Restaurant,Fried Chicken Joint,Distribution Center
2,Downtown Toronto,0,Coffee Shop,Clothing Store,Sandwich Place,Middle Eastern Restaurant,Hotel,Cosmetics Shop,Café,Restaurant,Theater,Movie Theater
3,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Italian Restaurant,Cocktail Bar,American Restaurant,Gastropub,Diner,Clothing Store,Bakery
4,East Toronto,0,Health Food Store,Pub,Trail,Church,Cupcake Shop,Dumpling Restaurant,Fish Market,Fish & Chips Shop,Fast Food Restaurant,Farmers Market
5,Downtown Toronto,0,Coffee Shop,Cocktail Bar,Restaurant,Seafood Restaurant,Farmers Market,Breakfast Spot,Cheese Shop,Café,Beer Bar,Hotel
6,Downtown Toronto,0,Coffee Shop,Clothing Store,Middle Eastern Restaurant,Japanese Restaurant,Breakfast Spot,Bubble Tea Shop,Sandwich Place,Fast Food Restaurant,Italian Restaurant,Restaurant
8,Downtown Toronto,0,Coffee Shop,Café,Restaurant,Gym,Thai Restaurant,Bakery,Concert Hall,Steakhouse,Cosmetics Shop,Sushi Restaurant
10,Downtown Toronto,0,Harbor / Marina,Theme Park,Fast Food Restaurant,Park,Farm,Donut Shop,Fish Market,Fish & Chips Shop,Farmers Market,Falafel Restaurant
11,West Toronto,0,Coffee Shop,Bar,Restaurant,Cocktail Bar,Wine Bar,Vietnamese Restaurant,Pizza Place,Asian Restaurant,Yoga Studio,New American Restaurant
