# Segmenting and Clustering Neighborhoods of Toronto

### 1. Fetching data to define neighborhood

To be able to cluster different neighborhoods of the city of Toronto, we will need to define their geospatial locations and boundaries.  We can obtain this data from `geopy`.  But, before we can do that we have to know more about each neigborhood like thier names and postal codes.

We will be able to achieve this by scraping the necessary data from a website.  the information we need is available at `https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M`

In [None]:
#!conda install -c conda-forge beautifulsoup4  #remove leading hashtag if beutifulsoup is not installed.

from bs4 import BeautifulSoup # the beautiful soup library will be used to scrape the data from wikipedia.

#!conda install -c conda-forge lxml # remove leading hashtag if lxml parser is not installed

#!conda install -c conda-forge requests # remove leading hashtag if requests library is not installed
import requests
import csv


In [None]:
#The contents of the webpage are fetched and stored in as the variable 'source'.  'source' is passed into beautiful soup and parsed to return the beautiful soup object.

source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

# To work in the portion of the parse tree that we are concerned with, the contents of the table containing the data of interest is stored in the variable 'table'.
table = soup.table.tbody 

# A csv file is created so that the table contents can be written to it.
csv_file = open('toronto_hood.csv', 'w')
csv_writer = csv.writer(csv_file)

# Each row of the table can be looped through and written to the csv file 'toronto_hood.csv'.
for table_row in table.find_all('tr'):
    field_one = table_row.next_element.next_element
    column_one = field_one.text
    
    field_two = field_one.next_sibling.next_sibling
    column_two = field_two.text

    # The /n linending must be removed from the neighbourhood and the 'Not assigned' values be changed to NaN so that pandas can recognize them.
    field_three = field_two.next_sibling.next_sibling
    column_three = field_three.text
    column_three = column_three[:-1]
    
    if column_three == "Not assigned":
        column_three = 'NaN'

    table_row = table_row.next_sibling.next_sibling
    
    csv_writer.writerow([column_one, column_two, column_three])

csv_file.close()

In [None]:
# import libraries

import pandas as pd

In [None]:
# The dataframe is created from the csv.
df = pd.read_csv('toronto_hood.csv')

df.head()

In [None]:
# The rows containing NaN are recognized by pandas and can be dropped and the index reset.

df.dropna(axis=0, inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

In [None]:
# The neighbourhoods are grouped and concantenated into a pandas Series for each postcode and assigned to a variable.
build_list = df.groupby('Postcode')['Neighbourhood'].apply(', '.join)

# A vector containing the distinct postcodes is created so that it can be looped through so the new values for 'Neighbourhood' in 'df'
pc_frame = df['Postcode'].unique()

In [None]:
# By iterating through the the column of Postcodes in 'pc_frame' the concantenated neighbourhoods can be written into the 'Neighbourhood' column of the original dataframe.

for i in range(102):
    n_hood = pc_frame[i]
    df['Neighbourhood'].loc[df['Postcode']== n_hood] = build_list[n_hood]


In [None]:
# As each postcode had one row per each neighbourhood belonging to it, many duplicate rows were created in the for-loop above.  These duplicates are dropped to obtain the final dataframe.
df.drop_duplicates(inplace=True)
df.reset_index(inplace=True, drop=True)
df.shape

df.tail()

### 2. Adding longitude and latitude data to dataframe

There was an issue in trying to obtain the geocoder library from Anaconda.  However, the geospatial data for Toronto is available from another source.

In [None]:
path = 'https://cocl.us/Geospatial_data'

lat_long_df = pd.read_csv(path, index_col=False)
lat_long_df.rename(columns={'Postal Code': 'Postcode'}, inplace=True)
lat_long_df.head()

In [None]:
toronto_data = df.merge(lat_long_df)

toronto_data.head()

### 3. Map of Toronto with Neighbourhoods

To dive into the data deeper, additional libraris are necessary for data manipulation and plotting.

In [None]:
import seaborn as sns
import folium
import numpy as np

import matplotlib.cm as cm
import matplotlib.colors as colors

from pandas.io.json import json_normalize

To begin, the map of Toronto is created.  As a starting point, the Downtown area will serve as the center of the map.

In [None]:
start_lat = toronto_data.at[2,'Latitude']
start_long = toronto_data.at[2, 'Longitude']

map_toronto = folium.Map(location=[start_lat, start_long], zoom_start=11)

# add the markers for each postcode
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

With the map ready, it's now time to call the Foursquare API.

In [None]:
# @hidden_cell
CLIENT_ID = 'THH2NPVS1CPSKHGMDLBYQO01HQXFPQLLOWN1MWT4024K00FE'
CLIENT_SECRET = 'VS53MNXLLW5GVHNFAEBHIWA5A1GD2WBGRSJIFYH3D5ASWWNC'
VERSION = '20190710'
LIMIT = 120

The resulting json is parsed for the venues by each neighborhood group.

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighbourhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

In [None]:
venue_count = toronto_venues.shape[0]

int(venue_count)
toronto_venues.head()

print(venue_count)

Our resulting data contains the top 120 venues found within 500 meters of the center of each of the 102 areas by postcode.  This could mean that neighbourhoods in larger postocodes may not in fact have any representation within the data.  Without neighbourhood-specific longitude and latitude data, this cannot be explored or confirmed.  The opposite of this "scarcity" issue may exist as well, where areas with higher density of both population and venues are likely to be defined in "smaller" postcodes.  The situation may arise where a venue may be within 500 meters of more than one geographic center of a postcode.  This can be evaluated by checking for duplicate venues within the data. 

In [None]:
toronto_venues.duplicated().value_counts()

The value counts returns 'False' for all venues from the data.  This means that in fact no venues were duplicated in the data fetched from the source.  In an attempt to force duplicate venues to occur, the data was fetched gain but with LIMIT increased from 120 to 250 and the radius increased from 500 to 800.  The result consisted of the same number of venues.  This would suggest that all venues in Toronto have been captured.

In [None]:
toronto_venues.groupby('Neighborhood').count()

Seeing as we are healthy eaters, we are not interested in fast food restaurants.  These will be removed so as to not impact the clustering results.

In [None]:
for i in range(venue_count):
    if toronto_venues.loc[i,'Venue Category'] == 'Fast Food Restaurant':
        toronto_venues.drop([i], inplace = True)
toronto_venues.reset_index(inplace=True, drop=True)
        
toronto_venues.head()






In [None]:
print(toronto_venues.shape)

In [None]:
# one hot encoding
# the line below considers the venue categories in the new dataframe.  The neighborhoods are dropped and will need to be added back in.
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood']
# print(toronto_onehot.columns[0])

# The line above added the neighborhood column to the end of the frame.  This can be moved back to the first column...
while toronto_onehot.columns[0] != 'Neighborhood':
    fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
    toronto_onehot = toronto_onehot[fixed_columns]
    if toronto_onehot.columns[0] == 'Afghan Restaurant':
        break

toronto_onehot.head(12)
# print(toronto_onehot.columns)

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').sum().reset_index()
toronto_grouped.head()

fun_hoods = toronto_grouped.shape[0]

print(fun_hoods)

Before ranking which venue types are found the most for each neighborhood, any neighborhood with fewer than 8 venues will be dropped.  For instance, if we want to cluster neighborhoods by the top 5 most occuring types of venue but there are only 3 venues in a neighborhood, the 4th and 5th most occurring venue type will have zero venues of that type in the target neighborhood.

In [None]:
for i in range(fun_hoods):
    if toronto_grouped.agg('sum', axis=1)[i] <8:
        toronto_grouped.drop([i], inplace=True)
toronto_grouped.reset_index(inplace=True, drop=True)

toronto_grouped.head()

In [None]:
def return_most_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns for number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_venues(toronto_grouped.iloc[ind, :], num_top_venues)

print(neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted.head()

### 4. Clustering of Neighborhoods

In [None]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each of first 10 rows
kmeans.labels_[0:10]

In [None]:
# add clusters labels to the dataframe
neighborhoods_venues_sorted.insert(0, 'Cluster Label', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), how='right', on='Neighbourhood')

toronto_merged.head()

In [None]:
toronto_merged.tail()

In [None]:
toronto_merged.shape

In [None]:
map_clusters = folium.Map(location=[start_lat, start_long], zoom_start=11)

# set color scheme for clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Label']):
    label = folium.Popup(str(poi) + 'Cluster' + str(cluster), parse_html=True)
    folium.CircleMarker(
    [lat, lon],
    radius=5,
    popup=label,
    color=rainbow[cluster-1],
    fill=True,
    fill_color=rainbow[cluster-1],
    fill_opacity=0.7).add_to(map_clusters)


In [None]:
map_clusters

Each of the 5 clusters can be evaluated in closer detail to illustrate the differences between them.

In [None]:
toronto_merged.loc[toronto_merged['Cluster Label'] == 0, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1])) ]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Label'] == 1, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1])) ]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Label'] == 2, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1])) ]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Label'] == 3, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1])) ]]

In [None]:
toronto_merged.loc[toronto_merged['Cluster Label'] == 4, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1])) ]]