<a href="https://colab.research.google.com/github/Mallveguine/Coursera_Capstone/blob/main/Final%20Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

# Introduction

The similarity of neighborhoods has been an important consideration in decision in choosing business location and city planning. The availability of large location data makes data analysis and visualization possible. Statistical methods and machine learning approaches enables decision making based on quantative analysis, which is more evidence-informed, robust, objective and comprehensive.  

Assume you are a store owner in Toronto in one of the neighborhoods and you want to change location for a new shop. The decision you make would based on the use of data making the best of the available information. 

# Data

This report is using Foursquare location data and postal code information from Wikipedia. Location data enables the visulization of results, as well as the analysis and interpretation of different venue categories, to facilitate decision making and can analyze data on a large scale, within a short time, relatively low computational cost and financial cost.  

# Methodology

Python was used for data analysis. K-mean clustering was used to cluster similar neighborhoods. A wide range of statistical methods were used to explore data structure and foster interpretation. Folium package was used to visualize data on an interactive map, facilitating analysis as well as increrasing the comprehensibility of the data. 

In [None]:
import numpy as np

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json

#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

import requests
from pandas.io.json import json_normalize


import matplotlib.cm as cm
import matplotlib.colors as colors


from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes
import folium

print('Libraries imported.')

Libraries imported.


In [None]:
!pip install lxml html5lib beautifulsoup4



In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

df1 = pd.read_html(url)

df1 = df1[0]

# only contained assigned borough
df1 = df1[df1['Borough'] != 'Not assigned']

df1 = df1.reset_index().drop(['index'], axis = 1)

df1.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


I was not able to get the geographical coordinates using the Geocoder package, csv file was used instead.

In [None]:
df2 = pd.read_csv('Geospatial_Coordinates.csv', header = 0)

df2.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [None]:
df = pd.merge(df1, df2, how='outer', on=(['Postal Code']))
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [None]:
df.shape

(103, 5)

This dataframe has 103 rows and 5 columns.

In [None]:
#get the coordinate fo Toronto

address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [None]:
#create a folium map and add markers

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [None]:
CLIENT_ID = 'UDAOTYI1IZQRQGRC2HYZCAP1TBPRBCTYY052WSLW35CERRE0'
CLIENT_SECRET = '0UBAFHOLRW3VYWMQTNHRMTZKMCZ0JQP24TKQTOQ1I2VATZHC'
VERSION = '20180605'
LIMIT = 100

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: UDAOTYI1IZQRQGRC2HYZCAP1TBPRBCTYY052WSLW35CERRE0
CLIENT_SECRET:0UBAFHOLRW3VYWMQTNHRMTZKMCZ0JQP24TKQTOQ1I2VATZHC


**Explore the first neighborhood**

In [None]:
df.loc[0, 'Neighbourhood']

neighborhood_latitude = df.loc[0, 'Latitude']
neighborhood_longitude = df.loc[0, 'Longitude']

neighborhood_name = df.loc[0, 'Neighbourhood']

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))


Latitude and longitude values of Parkwoods are 43.7532586, -79.3296565.


 Get the top 100 venues that are in Parkwoods within a radius of 500 meters

In [None]:
LIMIT = 100
radius = 500


url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues)


filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]


nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)


nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

2 venues were returned by Foursquare.




2 venues were returned by Foursquare.

## Explore Neighborhoods in Toronto

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
toronto_venues = getNearbyVenues(names=df['Neighbourhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [None]:
#venues were returned for each neighborhood
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,5,5,5,5,5,5
"Alderwood, Long Branch",7,7,7,7,7,7
"Bathurst Manor, Wilson Heights, Downsview North",21,21,21,21,21,21
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",22,22,22,22,22,22
Berczy Park,55,55,55,55,55,55
"Birch Cliff, Cliffside West",4,4,4,4,4,4
"Brockton, Parkdale Village, Exhibition Place",23,23,23,23,23,23
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",16,16,16,16,16,16
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16


In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 273 uniques categories.


## Analyze Each Neighborhood

In [None]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

**the 3 most common venues for each neighborhood**

In [None]:
#each neighborhood and the top 3 most common venues
num_top_venues = 3

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')


----Agincourt----
            venue  freq
0          Lounge   0.2
1  Breakfast Spot   0.2
2    Skating Rink   0.2


----Alderwood, Long Branch----
         venue  freq
0  Pizza Place  0.29
1     Pharmacy  0.14
2          Gym  0.14


----Bathurst Manor, Wilson Heights, Downsview North----
                venue  freq
0         Coffee Shop  0.10
1                Bank  0.10
2  Chinese Restaurant  0.05


----Bayview Village----
                 venue  freq
0  Japanese Restaurant  0.25
1                 Café  0.25
2   Chinese Restaurant  0.25


----Bedford Park, Lawrence Manor East----
                venue  freq
0      Sandwich Place  0.09
1  Italian Restaurant  0.09
2         Coffee Shop  0.09


----Berczy Park----
         venue  freq
0  Coffee Shop  0.09
1     Beer Bar  0.04
2  Cheese Shop  0.04


----Birch Cliff, Cliffside West----
                   venue  freq
0        College Stadium  0.25
1                   Café  0.25
2  General Entertainment  0.25


----Brockton, Parkdale Village,

**the top 10 venues for each neighborhood**

In [56]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)


neighborhoods_venues_sorted.head()


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Lounge,Skating Rink,Latin American Restaurant,Breakfast Spot,Clothing Store,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant
1,"Alderwood, Long Branch",Pizza Place,Gym,Pharmacy,Coffee Shop,Sandwich Place,Pub,Women's Store,Dog Run,Dim Sum Restaurant,Diner
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Mobile Phone Shop,Bridal Shop,Sandwich Place,Diner,Restaurant,Deli / Bodega,Supermarket,Middle Eastern Restaurant
3,Bayview Village,Café,Japanese Restaurant,Chinese Restaurant,Bank,Women's Store,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
4,"Bedford Park, Lawrence Manor East",Coffee Shop,Sandwich Place,Italian Restaurant,Greek Restaurant,Sushi Restaurant,Pharmacy,Pizza Place,Pub,Café,Butcher


## Cluster Neighborhoods

Use k-means to cluster the neighborhood into 5 clusters.

In [None]:
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)
kmeans.labels_[0:10]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [57]:
#a new dataframe that includes the cluster and the top 10 venues for each neighborhood
neighborhoods_venues_sorted.insert(0, 'Clusters labels', kmeans.labels_, allow_duplicates=True, )
toronto_merged = df

toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')
toronto_merged.dropna(axis=0, inplace=True)
toronto_merged.head()
a = toronto_merged.groupby(['Clusters labels'], sort=True)['Neighbourhood', 'Clusters labels']
a.head()

  


Unnamed: 0,Neighbourhood,Clusters labels
0,Parkwoods,0.0
1,Victoria Village,1.0
2,"Regent Park, Harbourfront",1.0
3,"Lawrence Manor, Lawrence Heights",1.0
4,"Queen's Park, Ontario Provincial Government",1.0
6,"Malvern, Rouge",4.0
7,Don Mills,1.0
11,"West Deane Park, Princess Gardens, Martin Grov...",3.0
21,Caledonia-Fairbanks,0.0
35,"East Toronto, Broadview North (Old East York)",0.0


# Results

**Neighborhood in differenct clusters**

In [81]:
#first_cluster
first_cluster = toronto_merged.loc[toronto_merged['Clusters labels'] == 0.0, 'Neighbourhood']
print(first_cluster)
len(first_cluster)

0                                             Parkwoods
21                                  Caledonia-Fairbanks
35        East Toronto, Broadview North (Old East York)
49             North Park, Maple Leaf Park, Upwood Park
61                                        Lawrence Park
64                                               Weston
66                                      York Mills West
85    Milliken, Agincourt North, Steeles East, L'Amo...
91                                             Rosedale
Name: Neighbourhood, dtype: object


9

In [83]:
#second_cluster
second_cluster = toronto_merged.loc[toronto_merged['Clusters labels'] == 1.0, 'Neighbourhood']
print(second_cluster.head())
len(second_cluster)

1                               Victoria Village
2                      Regent Park, Harbourfront
3               Lawrence Manor, Lawrence Heights
4    Queen's Park, Ontario Provincial Government
7                                      Don Mills
Name: Neighbourhood, dtype: object


88

In [84]:
#third_cluster
third_cluster = toronto_merged.loc[toronto_merged['Clusters labels'] == 2.0, 'Neighbourhood']
print(third_cluster.head())
len(third_cluster)

45    York Mills, Silver Hills
Name: Neighbourhood, dtype: object


1

In [85]:
#fourth_cluster
fourth_cluster = toronto_merged.loc[toronto_merged['Clusters labels'] == 3.0, 'Neighbourhood']
print(fourth_cluster.head())
len(fourth_cluster)

11    West Deane Park, Princess Gardens, Martin Grov...
Name: Neighbourhood, dtype: object


1

In [86]:
#fifth_cluster
fifth_cluster = toronto_merged.loc[toronto_merged['Clusters labels'] == 4.0, 'Neighbourhood']
print(fifth_cluster.head())
len(fifth_cluster)

6    Malvern, Rouge
Name: Neighbourhood, dtype: object


1

In [87]:
toronto_merged.loc[toronto_merged['Neighbourhood'] == 'York Mills, Silver Hills']

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Clusters labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
45,M2L,North York,"York Mills, Silver Hills",43.75749,-79.374714,2.0,Martial Arts School,Women's Store,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant


There is two neighborhoods in the third cluster in the same borough - York Mills and Silver Mills. Martial Arts School is the most common venue.  

In [88]:
toronto_merged.loc[toronto_merged['Neighbourhood'] == 'West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale']

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Clusters labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,M9B,Etobicoke,"West Deane Park, Princess Gardens, Martin Grov...",43.650943,-79.554724,3.0,Print Shop,Women's Store,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Drugstore,Dessert Shop


There is five neighborhoods in the fourth cluster in the same borough -West Deane Park, Princess Gardens, Martin Grove, Islington and Cloverdale. Print shop is the most common venue.  

In [89]:
toronto_merged.loc[toronto_merged['Neighbourhood'] == 'Malvern, Rouge']

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Clusters labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,4.0,Fast Food Restaurant,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Wings Joint


There is two neighborhoods in the fifth cluster in the same borough - Malvern and Rouge. Fast food restaurant is the most common venue.

The neighborhoods in different clusters are shown above. 

# Discussion

In [None]:
# visualize
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Clusters labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Most neighborhoods are concentrated in the second cluster with the three out of five clusters having only one borough. This may suggest the high similarity between neighborhoods. The division of different functional areas in the city may not be distinct to each other to a great extent. As is shown in the map, the second cluster covers most of the city while the other clusters are embedded inside, with no clear boundaries between clusters.  

Thus, if a change of location is envisioned, based on where the shop is originally located, the decision might be different. If it is located in the neighborhoods of the second cluster, it should not be a problem moving to most of the places in the city (only taking account of venue categories). This could also be the saem for if it is located in the neighborhoods of the first cluster. Moving from York Mills, Silver Hills, West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale, Malvern and Rouge could be more chanllenged. However, considering the embedding distribution of the neighborhoods in different clusters, and relative close distance, even being located in the neighborhoods of last three clusters could still be within customers' travel range and have enough ability to attract customers. Nevertheless, many other factors including prices, downtown/uptown, subjective consideration etc. are also of great importance.  

# Conclusion

1. Change of location should be based on where the shop is originally located.
2. Most neighborhoods are concentrated in the second cluster.
3. Other factors also play an important part. 