<h3>Segmenting and Clustering Neighborhoods</h3>

Import packages that are needed to load wiki table

<h3>Part I: create dataframe from wiki</h3>

In [15]:
import pandas as pd
import numpy as np

In [16]:
raw_wiki = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

In [17]:
wiki = pd.DataFrame(raw_wiki[0])

Remove any rows that there have no Borough assigned and rename the Postal Code column to "PostalCode"

In [18]:
wiki.drop(wiki[wiki['Borough'] == 'Not assigned'].index, inplace=True)
wiki.rename(columns = {'Postal Code': 'PostalCode'}, inplace=True)

In [19]:
wiki.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


To merge rows that have the same postal code and append the Neighborhood list, we will be using the groupby method. After grouping by postal codes, reset the index

In [20]:
df = wiki.groupby('PostalCode').agg({'Borough':'first','Neighborhood':''.join}).reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


If there are any Neighborhood that do not have a name, we replace the value with the Borough name

In [21]:
mask = df['Neighborhood'] == 'Not assigned'
df.loc[mask, 'Neighborhood'] = df.loc[mask, 'Borough']

In [22]:
df.shape

(103, 3)

<h3>Part II: find location coordinates for each neighborhood</h3>

In [23]:

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

read location coordinate data from csv file and store in dataframe

In [24]:
location_df = pd.read_csv("http://cocl.us/Geospatial_data")
location_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


loop through each postal code and find the corresponding coordinates from the location dataframe, then add the longitude and latitude in the seperate columns to the original dataframe

In [25]:
Latitude = []
Longitude = []

for postal_code in df['PostalCode']:
    latitude = location_df.loc[location_df['Postal Code'] == postal_code, 'Latitude'].values[0]
    longitude = location_df.loc[location_df['Postal Code'] == postal_code, 'Longitude'].values[0]
    Latitude.append(latitude)
    Longitude.append(longitude)
    
df['Latitude'] = Latitude
df['Longitude'] = Longitude

In [27]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


<h3>Part III: Explore and Cluster neighborhoods in Scarborough</h3>

get Scarborough neighborhood data

In [28]:
scarborough = df.loc[df['Borough'] == 'Scarborough']

In [29]:
scarborough.shape

(17, 5)

#### Define Foursquare Credentials and Version

In [30]:
CLIENT_ID = 'XWOIXYORH4AD3I0TCCRPITM25V04X3CROI1Q2TJQMBCXYS5G' # your Foursquare ID
CLIENT_SECRET = 'VZ2RYOIVJMQZI3RHSCZG4CWPZNPVXUGI2MGZFF1HCNJG3JGB' # your Foursquare Secret
VERSION = '20200624' # Foursquare API version
LIMIT = 100
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: XWOIXYORH4AD3I0TCCRPITM25V04X3CROI1Q2TJQMBCXYS5G
CLIENT_SECRET:VZ2RYOIVJMQZI3RHSCZG4CWPZNPVXUGI2MGZFF1HCNJG3JGB


create a function to repeat the same process to all the neighborhoods in Scarborough


In [31]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):         
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called *manhattan_venues*.

In [32]:
scarborough_venues = getNearbyVenues(names=scarborough['Neighborhood'],
                                   latitudes=scarborough['Latitude'],
                                   longitudes=scarborough['Longitude']
                                  )
scarborough_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Guildwood, Morningside, West Hill",43.763573,-79.188711,RBC Royal Bank,43.76679,-79.191151,Bank
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Big Bite Burrito,43.766299,-79.19072,Mexican Restaurant


Analyze each neighborhood in Scarborough and convert venue Category to a categorical variable

In [33]:
# one hot encoding
scarborough_onehot = pd.get_dummies(scarborough_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
scarborough_onehot['Neighborhood'] = scarborough_venues['Neighborhood'] 
# move neighborhood column to the first column
fixed_columns = [scarborough_onehot.columns[-1]] + list(scarborough_onehot.columns[:-1])
scarborough_onehot = scarborough_onehot[fixed_columns]

scarborough_onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Auto Garage,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,...,Playground,Rental Car Location,Sandwich Place,Shopping Mall,Skating Rink,Soccer Field,Supermarket,Thai Restaurant,Train Station,Vietnamese Restaurant
0,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Rouge Hill, Port Union, Highland Creek",0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Guildwood, Morningside, West Hill",0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Guildwood, Morningside, West Hill",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [34]:
scarborough_mean = scarborough_onehot.groupby('Neighborhood').mean().reset_index()
scarborough_mean.head()

Unnamed: 0,Neighborhood,American Restaurant,Athletics & Sports,Auto Garage,Bakery,Bank,Bar,Breakfast Spot,Bus Line,Bus Station,...,Playground,Rental Car Location,Sandwich Place,Shopping Mall,Skating Rink,Soccer Field,Supermarket,Thai Restaurant,Train Station,Vietnamese Restaurant
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,...,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0
1,"Birch Cliff, Cliffside West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0
2,Cedarbrae,0.0,0.125,0.0,0.125,0.125,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0
3,"Clarks Corners, Tam O'Shanter, Sullivan",0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,0.066667,0.0,0.0
4,"Cliffside, Cliffcrest, Scarborough Village West",0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [35]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

In [58]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
scarborough_top_venues = pd.DataFrame(columns=columns)
scarborough_top_venues['Neighborhood'] = scarborough_mean['Neighborhood']

for ind in np.arange(scarborough_mean.shape[0]):
    scarborough_top_venues.iloc[ind, 1:] = return_most_common_venues(scarborough_mean.iloc[ind, :], num_top_venues)

scarborough_top_venues

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Skating Rink,Breakfast Spot,Latin American Restaurant,Lounge,Clothing Store,Vietnamese Restaurant,Convenience Store,Gaming Cafe,Furniture / Home Store,Fried Chicken Joint
1,"Birch Cliff, Cliffside West",General Entertainment,College Stadium,Skating Rink,Farm,Café,Gas Station,Gaming Cafe,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant
2,Cedarbrae,Hakka Restaurant,Caribbean Restaurant,Thai Restaurant,Athletics & Sports,Gas Station,Bakery,Bank,Fried Chicken Joint,Department Store,Discount Store
3,"Clarks Corners, Tam O'Shanter, Sullivan",Pizza Place,Pharmacy,Gas Station,Chinese Restaurant,Noodle House,Italian Restaurant,Intersection,Convenience Store,Bank,Shopping Mall
4,"Cliffside, Cliffcrest, Scarborough Village West",American Restaurant,Motel,Gas Station,Gaming Cafe,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Farm,Electronics Store,Discount Store
5,"Dorset Park, Wexford Heights, Scarborough Town...",Indian Restaurant,Vietnamese Restaurant,Gaming Cafe,Light Rail Station,Pet Store,Chinese Restaurant,College Stadium,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant
6,"Golden Mile, Clairlea, Oakridge",Bakery,Bus Line,Soccer Field,Ice Cream Shop,Intersection,Metro Station,Bus Station,Park,Electronics Store,Cosmetics Shop
7,"Guildwood, Morningside, West Hill",Mexican Restaurant,Bank,Intersection,Breakfast Spot,Rental Car Location,Electronics Store,Medical Center,Vietnamese Restaurant,Furniture / Home Store,Fried Chicken Joint
8,"Kennedy Park, Ionview, East Birchmount Park",Coffee Shop,Bus Station,Department Store,Chinese Restaurant,Train Station,Bank,Bar,General Entertainment,Gas Station,Gaming Cafe
9,"Malvern, Rouge",Fast Food Restaurant,Vietnamese Restaurant,Grocery Store,Gas Station,Gaming Cafe,Furniture / Home Store,Fried Chicken Joint,Farm,Electronics Store,Discount Store


Run *k*-means to cluster the neighborhood into 5 clusters.

In [49]:
!pip install folium

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/a4/f0/44e69d50519880287cc41e7c8a6acc58daa9a9acf5f6afc52bcc70f69a6d/folium-0.11.0-py2.py3-none-any.whl (93kB)
[K     |████████████████████████████████| 102kB 8.1MB/s ta 0:00:011
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


In [50]:

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import folium # map rendering library

In [59]:
# import k-means from clustering stage
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 5

scarborough_cluster = scarborough_mean.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(scarborough_cluster)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([1, 1, 1, 1, 3, 1, 1, 1, 1, 4, 0, 2, 0, 1, 1, 1], dtype=int32)

create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [60]:
# add clustering labels
scarborough_top_venues.insert(0, 'Cluster Labels', kmeans.labels_)
scarborough_top_venues.head()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,Agincourt,Skating Rink,Breakfast Spot,Latin American Restaurant,Lounge,Clothing Store,Vietnamese Restaurant,Convenience Store,Gaming Cafe,Furniture / Home Store,Fried Chicken Joint
1,1,"Birch Cliff, Cliffside West",General Entertainment,College Stadium,Skating Rink,Farm,Café,Gas Station,Gaming Cafe,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant
2,1,Cedarbrae,Hakka Restaurant,Caribbean Restaurant,Thai Restaurant,Athletics & Sports,Gas Station,Bakery,Bank,Fried Chicken Joint,Department Store,Discount Store
3,1,"Clarks Corners, Tam O'Shanter, Sullivan",Pizza Place,Pharmacy,Gas Station,Chinese Restaurant,Noodle House,Italian Restaurant,Intersection,Convenience Store,Bank,Shopping Mall
4,3,"Cliffside, Cliffcrest, Scarborough Village West",American Restaurant,Motel,Gas Station,Gaming Cafe,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Farm,Electronics Store,Discount Store


In [67]:
scarborough_top_venues.tail()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,2,"Rouge Hill, Port Union, Highland Creek",Bar,Vietnamese Restaurant,College Stadium,Gas Station,Gaming Cafe,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Farm,Electronics Store
12,0,Scarborough Village,Playground,Convenience Store,Grocery Store,Vietnamese Restaurant,College Stadium,Gas Station,Gaming Cafe,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant
13,1,"Steeles West, L'Amoreaux West",Chinese Restaurant,Breakfast Spot,Fast Food Restaurant,Cosmetics Shop,Discount Store,Pharmacy,Pizza Place,Coffee Shop,Sandwich Place,Bank
14,1,"Wexford, Maryvale",Middle Eastern Restaurant,Bakery,Sandwich Place,Breakfast Spot,Auto Garage,Furniture / Home Store,Fried Chicken Joint,Gaming Cafe,College Stadium,Fast Food Restaurant
15,1,Woburn,Coffee Shop,Korean Restaurant,College Stadium,Gas Station,Gaming Cafe,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Farm,Electronics Store


In [61]:
# add latitude/longitude for each neighborhood
scarborough_merged = scarborough.join(scarborough_top_venues.set_index('Neighborhood'), on='Neighborhood')
scarborough_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,4.0,Fast Food Restaurant,Vietnamese Restaurant,Grocery Store,Gas Station,Gaming Cafe,Furniture / Home Store,Fried Chicken Joint,Farm,Electronics Store,Discount Store
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497,2.0,Bar,Vietnamese Restaurant,College Stadium,Gas Station,Gaming Cafe,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Farm,Electronics Store
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,1.0,Mexican Restaurant,Bank,Intersection,Breakfast Spot,Rental Car Location,Electronics Store,Medical Center,Vietnamese Restaurant,Furniture / Home Store,Fried Chicken Joint
3,M1G,Scarborough,Woburn,43.770992,-79.216917,1.0,Coffee Shop,Korean Restaurant,College Stadium,Gas Station,Gaming Cafe,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Farm,Electronics Store
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1.0,Hakka Restaurant,Caribbean Restaurant,Thai Restaurant,Athletics & Sports,Gas Station,Bakery,Bank,Fried Chicken Joint,Department Store,Discount Store


visualize the resulting clusters

In [76]:
scarborough_merged = scarborough_merged.dropna()
scarborough_merged.tail()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,M1R,Scarborough,"Wexford, Maryvale",43.750072,-79.295849,1.0,Middle Eastern Restaurant,Bakery,Sandwich Place,Breakfast Spot,Auto Garage,Furniture / Home Store,Fried Chicken Joint,Gaming Cafe,College Stadium,Fast Food Restaurant
12,M1S,Scarborough,Agincourt,43.7942,-79.262029,1.0,Skating Rink,Breakfast Spot,Latin American Restaurant,Lounge,Clothing Store,Vietnamese Restaurant,Convenience Store,Gaming Cafe,Furniture / Home Store,Fried Chicken Joint
13,M1T,Scarborough,"Clarks Corners, Tam O'Shanter, Sullivan",43.781638,-79.304302,1.0,Pizza Place,Pharmacy,Gas Station,Chinese Restaurant,Noodle House,Italian Restaurant,Intersection,Convenience Store,Bank,Shopping Mall
14,M1V,Scarborough,"Milliken, Agincourt North, Steeles East, L'Amo...",43.815252,-79.284577,0.0,Playground,Park,Vietnamese Restaurant,Coffee Shop,Gaming Cafe,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Farm,Electronics Store
15,M1W,Scarborough,"Steeles West, L'Amoreaux West",43.799525,-79.318389,1.0,Chinese Restaurant,Breakfast Spot,Fast Food Restaurant,Cosmetics Shop,Discount Store,Pharmacy,Pizza Place,Coffee Shop,Sandwich Place,Bank


In [77]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(scarborough_merged['Latitude'], scarborough_merged['Longitude'], scarborough_merged['Neighborhood'], scarborough_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters