#### Author: DIEP, LY BAO LONG
#### Assignment: Segmenting and Clustering Neighborhoods in Toronto.
#### Day: 12-03-2021

In [2]:
import pandas as pd
import numpy as np
import requests
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

In [11]:
# https://pythonbasics.org/pandas-web-scraping/#:~:text=Pandas%20makes%20it%20easy%20to,Excel%20file%20or%20csv%20file.
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = pd.read_html(url)
# toronto_df = html[0].dropna(subset = ['Borough'])
# print(f'Shape: {toronto_df.shape[0]}x{toronto_df.shape[1]}')
# toronto_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M1ANot assigned,M2ANot assigned,M3ANorth York(Parkwoods),M4ANorth York(Victoria Village),M5ADowntown Toronto(Regent Park / Harbourfront),M6ANorth York(Lawrence Manor / Lawrence Heights),M7AQueen's Park(Ontario Provincial Government),M8ANot assigned,M9AEtobicoke(Islington Avenue)
1,M1BScarborough(Malvern / Rouge),M2BNot assigned,M3BNorth York(Don Mills)North,M4BEast York(Parkview Hill / Woodbine Gardens),"M5BDowntown Toronto(Garden District, Ryerson)",M6BNorth York(Glencairn),M7BNot assigned,M8BNot assigned,M9BEtobicoke(West Deane Park / Princess Garden...
2,M1CScarborough(Rouge Hill / Port Union / Highl...,M2CNot assigned,M3CNorth York(Don Mills)South(Flemingdon Park),M4CEast York(Woodbine Heights),M5CDowntown Toronto(St. James Town),M6CYork(Humewood-Cedarvale),M7CNot assigned,M8CNot assigned,M9CEtobicoke(Eringate / Bloordale Gardens / Ol...
3,M1EScarborough(Guildwood / Morningside / West ...,M2ENot assigned,M3ENot assigned,M4EEast Toronto(The Beaches),M5EDowntown Toronto(Berczy Park),M6EYork(Caledonia-Fairbanks),M7ENot assigned,M8ENot assigned,M9ENot assigned
4,M1GScarborough(Woburn),M2GNot assigned,M3GNot assigned,M4GEast York(Leaside),M5GDowntown Toronto(Central Bay Street),M6GDowntown Toronto(Christie),M7GNot assigned,M8GNot assigned,M9GNot assigned
5,M1HScarborough(Cedarbrae),M2HNorth York(Hillcrest Village),M3HNorth York(Bathurst Manor / Wilson Heights ...,M4HEast York(Thorncliffe Park),M5HDowntown Toronto(Richmond / Adelaide / King),M6HWest Toronto(Dufferin / Dovercourt Village),M7HNot assigned,M8HNot assigned,M9HNot assigned
6,M1JScarborough(Scarborough Village),M2JNorth York(Fairview / Henry Farm / Oriole),M3JNorth York(Northwood Park / York University),M4JEast YorkEast Toronto(The Danforth East),M5JDowntown Toronto(Harbourfront East / Union ...,M6JWest Toronto(Little Portugal / Trinity),M7JNot assigned,M8JNot assigned,M9JNot assigned
7,M1KScarborough(Kennedy Park / Ionview / East B...,M2KNorth York(Bayview Village),M3KNorth York(Downsview)East (CFB Toronto),M4KEast Toronto(The Danforth West / Riverdale),M5KDowntown Toronto(Toronto Dominion Centre / ...,M6KWest Toronto(Brockton / Parkdale Village / ...,M7KNot assigned,M8KNot assigned,M9KNot assigned
8,M1LScarborough(Golden Mile / Clairlea / Oakridge),M2LNorth York(York Mills / Silver Hills),M3LNorth York(Downsview)West,M4LEast Toronto(India Bazaar / The Beaches West),M5LDowntown Toronto(Commerce Court / Victoria ...,M6LNorth York(North Park / Maple Leaf Park / U...,M7LNot assigned,M8LNot assigned,M9LNorth York(Humber Summit)
9,M1MScarborough(Cliffside / Cliffcrest / Scarbo...,M2MNorth York(Willowdale / Newtonbrook),M3MNorth York(Downsview)Central,M4MEast Toronto(Studio District),M5MNorth York(Bedford Park / Lawrence Manor East),M6MYork(Del Ray / Mount Dennis / Keelsdale and...,M7MNot assigned,M8MNot assigned,M9MNorth York(Humberlea / Emery)


#### PERSONAL EVALUATION: First Look at Dataset
- To easily scrape the data from a webpage into a pandas DataFrame, we can simply use pandas function (read_html) to do it.
- Since there were a lot of rows of Borough and Neighbourhood containing "Not assigned" values, we can assign these "Not assigned" values as missing/ unavailable values (NaN) at the very begining of scraping the data from Wikipedia.

In [None]:
latlng_df = pd.read_csv('Geospatial_Coordinates.csv')
toronto_df = pd.merge(toronto_df, latlng_df, on = 'Postal Code')
toronto_df.head()

#### Let's visually look at all of the boroughs in Toronto and its neighborhood 

In [None]:
address = 'Toronto, ON'
geolocator = Nominatim(user_agent = 'to_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(f'Toronto\nLat: {latitude}\nLgn: {longitude}')

In [None]:
map_toronto = folium.Map([latitude,longitude], zoom_start = 13)
for lat,lng, borough, neighbourhood in zip(toronto_df.Latitude, toronto_df.Longitude, toronto_df.Borough, toronto_df.Neighbourhood):
    label = f'Neighborhood: {neighbourhood}\nBorough: {borough}'
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = 'red',
        fill = True,
        fill_color = 'yellow',
        fill_opacity = 0.7,
        parse_html = False
    ).add_to(map_toronto)
map_toronto

In [None]:
borough_df = toronto_df.groupby('Borough').count().sort_values('Neighbourhood',ascending = False)
borough_df

#### Choice of borough
- Taking a look at the above borough_df dataframe after being grouped by the Borough columns, we can vividly see that the North York seem to be the potential borough because it has up to 24 neighborhoods  with different Postal Code among the others.
- Examining the neighborhoods of North York region, we may explore a variety of venues which surround these neighbourhoods.

#### Get the North York DataFrame from the wholde dataset

In [None]:
to_df = toronto_df[toronto_df['Borough'].str.contains('Toronto')].reset_index(drop = True)
# address = 'North York, ON'
# geolocator = Nominatim(user_agent='to_explorer')
# location = geolocator.geocode(address)
# ny_lat = location.latitude
# ny_lng = location.longitude
# print(f'North York\nLatitude: {ny_lat}\nLongitude: {ny_lng}')
print(f'Shape: {to_df.shape[0]}x{to_df.shape[1]}')
to_df.head()

In [None]:
# To find out any similar values in Neighborhood
to_df.Neighbourhood.value_counts()

In [None]:
map_to = folium.Map([latitude,longitude],zoom_start = 11)
for lat, lng, neigh, pc in zip(to_df.Latitude, to_df.Longitude, to_df.Neighbourhood, to_df['Postal Code']):
    label = f'Neighbor: {neigh}\nPostal Code: {pc}'
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat,lng],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = 'red',
        fill_opacity = 0.7,
        parse_html = False
    ).add_to(map_to)
map_to

#### Using Foursquare to explore the neighborhoods

In [None]:
CLIENT_ID = 'yAHPPBLZ4QX43IMOBWWZPQPW2GO2PC403TLIIJPTXEUDV1PGJ' # your Foursquare ID
CLIENT_SECRET = 'OWL05PQU02RVOHOS45IIGJ3SCHJOYOGUFNOOL21SPROSTZ1I' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
ACCESS_TOKEN = 'OZ0KJZ4ZYJPGUBRGQPDA2QXNO5MTBC4IG2SVMXP2K0LBGBQM'
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
print('ACCESS_TOKEN:' + ACCESS_TOKEN)

#### Instruction
- This function is used to get the venues of each neighborhood based on its latitude and longitude via the url link which is created by Foursquare.

In [None]:
def getNearbyVenues(pc, names, latitude, longitude, radius = 500):
    venues_list = []
    for pc, names, lat, lng in zip(pc, names, latitude, longitude):
        print(names)
        
        url = f'https://api.foursquare.com/v2/venues/explore?&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}&ll={lat},{lng}&oauth_token={ACCESS_TOKEN}&radius={radius}&limit={LIMIT}'
        results = requests.get(url).json()["response"]['groups'][0]['items']
        venues_list.append([(
            pc,
            names,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code',
            'Neighborhood', 
              'Neighborhood Latitude', 
              'Neighborhood Longitude', 
              'Venue', 
              'Venue Latitude', 
              'Venue Longitude', 
              'Venue Category']
    return(nearby_venues)

In [None]:
to_venues = getNearbyVenues(to_df['Postal Code'],to_df.Neighbourhood, to_df.Latitude, to_df.Longitude)
to_venues.head()

In [None]:
# Let's have an overall observation on the total number of Venues of each Postal Code 
to_venues.groupby('Postal Code').count().sort_values('Venue',ascending=False).loc[:,'Venue':'Venue Category']

In [None]:
print(f'Number of unique categories: {len(to_venues["Venue Category"].unique())}')

#### One-hot Coding technique
- Now, we will utilize the one-hot coding technique to turn each venue categories which are found based on the Postal Code into categorical data (1 if the venue is available, otherwise, 0)

In [None]:
to_onehot = pd.get_dummies(to_venues[['Venue Category']],prefix="",prefix_sep="")
to_onehot[['Postal Code', 'Neighborhood']] = to_venues[['Postal Code','Neighborhood']]
fixed_cols = list(to_onehot.columns[-2:]) + list(to_onehot.columns[:-2])
to_onehot = to_onehot[fixed_cols]
to_onehot.head()

In [None]:
ny_onehot.shape

#### PERSONAL EVALUATION: Differences in the dataset
- Briefly glacing at the dataset, it is noticeable that the Neighborhood column has a number of similar values, but different Postal Code. Hence, the Postal Code column can be used as a key feature in the process of exploring and clustering the neighborhood in Toronto.
- It is evident that after filling the dataset with the Latitude and Longitude columns based on the Postal Code, we are able to point out that although there are a number of similar values in the Neighborhood columns, these similar neiborhoods totally get distinct Postal Code, Latitude, and Longitude.
- In conclusion, if we choose to group the dataset by the Neighborhood, there will be a whole host of potential venues can be missed in the exploring and clustering process. Therefore, utilizing the Postal Code as key feature for grouping should be taken into account because of its distinction.

In [None]:
to_grouped = to_onehot.groupby(['Postal Code','Neighborhood']).mean().reset_index()
to_grouped
# ny_onehot.groupby('Postal Code')

In [None]:
# This code is used to observe the frequent appearance of each venues categories
num_top_venues = 8
for pc,hood in zip(to_grouped['Postal Code'],to_grouped['Neighborhood']):
    print("----"+f'{hood} (PC:{pc})'+"----")
    temp = to_grouped[to_grouped['Postal Code'] == pc].T.reset_index() # transpose the index and the columns, then get the corresponding neighborhood and its columns 
    temp.columns = ['venue','freq']
    temp = temp.iloc[2:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
# This is for the illustration for the above loop (do not regard about this cell, this cell is only used for coder's purpose)
# ny_grouped[ny_grouped['Postal Code'] == 'M2H'].T

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending = False)
    return row_categories_sorted.index.values[0:num_top_venues]

# # Explaination for the above function (do not concern this block of code, this is only used for the coder's purpose)
# row = ny_grouped.iloc[0,:] # get the first row and all columns
# row_categories = row.iloc[2:] # only take the row with numeric values
# row_categories_sorted = row_categories.sort_values(ascending = False)
# row_categories_sorted

#### Clustering the neiborhood.
- Observing the frequency of each of the venue categories, there are a number of neighborhoods that have limited venues.
- Hence, to make sure there will be available venues in each of the neighborhoods, we will take only 4 top/ most common venues to cluster.

In [None]:
num_top_venues = 4
indicators = ['st','nd','rd']
columns = ['Postal Code','Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append(f'{ind+1}{indicators[ind]} Most Common Venue')
    except:
        columns.append(f'{ind+1}th Most Common Venue')

neigh_venues_sorted = pd.DataFrame(columns = columns)
neigh_venues_sorted[['Postal Code','Neighborhood']] = to_grouped[['Postal Code','Neighborhood']]
for ind in np.arange(to_grouped.shape[0]):
#     neigh_venues_sorted.iloc[0,0:] 
    neigh_venues_sorted.iloc[ind,2:] = return_most_common_venues(to_grouped.iloc[ind,:],num_top_venues)
neigh_venues_sorted

In [None]:
# Based on the frequency of venue categories, we can cluster the neighborhoods into 3 distinct group
kclusters = 5
to_grouped_clustering = to_grouped.drop(['Postal Code','Neighborhood'],1)
kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(ny_grouped_clustering)
print(f'Number of Cluster Labels: {len(kmeans.labels_)}')
print(f'Labels: {kmeans.labels_}')

In [None]:
neigh_venues_sorted.insert(0,'Cluster Labels',kmeans.labels_)
to_merged = to_df
to_merged = to_merged.join(neigh_venues_sorted.set_index('Postal Code'), on = 'Postal Code')
columns = [to_merged.columns[1]] + [to_merged.columns[0]] + list(to_merged.columns[2:])  
to_merged = to_merged[columns]
to_merged

In [None]:
map_clusters = folium.Map(location = [latitude,longitude], zoom_start = 11)
# create color code for each of the distinct clusters
x = np.arange(kclusters)
ys = [i + x +(i*x)**2 for i in range (kclusters)]
colors_array = cm.rainbow(np.linspace(0,1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array] 

for lat, lng, poi, cluster in zip(to_merged.Latitude, to_merged.Longitude, to_merged.Neighbourhood, to_merged['Cluster Labels']):
    label = folium.Popup(f'{str(poi)}\nCluster: {str(cluster)}', parse_html = True)
    folium.CircleMarker(
        [lat,lng],
        radius = 5,
        popup = label,
        color = rainbow[cluster],
        fill = True,
        fill_color = rainbow[cluster],
        fill_opacity = 0.7
    ).add_to(map_clusters)
map_clusters
# rainbow[3]

#### Examine the cluster
- We can perceptually point out that the distribution of the cluster 2 (in red) is overweight the other clusters.

In [None]:
ny_merged.loc[to_merged['Cluster Labels'] == 0, to_merged.columns[[1] + list(range(5, to_merged.shape[1]))]]

In [None]:
ny_merged.loc[to_merged['Cluster Labels'] == 1, to_merged.columns[[1] + list(range(5, to_merged.shape[1]))]]

In [None]:
ny_merged.loc[to_merged['Cluster Labels'] == 2, to_merged.columns[[1] + list(range(5, to_merged.shape[1]))]]

- It is remarkable that there a variety of venue categories in cluster 2.
- The neighborhoods in cluster 2 will offer more common venues for the observers. If we choose these neighborhoods as our destination, we likely have more options for our trip.
- In contrast, cluster 0 and cluster 1 are not diverse in venue categories, which may provide poor amenities.