# Segmenting and Clustering Neighborhoods in Toronto

## Part 1

Importing necessary libraries:

In [105]:
import pandas as pd
import numpy as np

Installing for reading the table on Wikipedia:

In [106]:
!pip install lxml



Assigning the output, from the URL reading, to df:

In [107]:
url = 'http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(url)

Assigning the first list element (pandas data frame) to df:

In [108]:
df = df[0]

Populating Neighbourhood values: 

In [109]:
for i in range (0,288):
    if df.iloc[[i],[2]].values == 'Not assigned':
        df.iloc[[i],[2]] = df.iloc[[i],[1]].values
    else:
        df

Replacing unassigned values with NaN and eventually dropping those rows:

In [110]:
df.replace(to_replace={'Borough':{'Not assigned':np.nan}},inplace =True)
df.dropna(inplace=True)

Adapted from the Coursera Discussion forum for unique Postcode entries:

In [111]:
df = df.groupby(['Postcode','Borough'], sort = False).agg(lambda x: ','.join(x))
df.reset_index(level=['Postcode','Borough'], inplace=True)

Dimensions of the final Data Frame, df, for Part 1:

In [112]:
df.shape

(103, 3)

## Part 2

Defining `postal_code` for separate storage of Postcode column data: 

In [113]:
postal_code = df['Postcode'].values

Attempt to use `geocoder`, but taking much time:

Using CSV file instead:

In [114]:
latlog = pd.read_csv('https://cocl.us/Geospatial_data')

Joining the data frames:

In [115]:
df = df.join(latlog.set_index('Postal Code'), on='Postcode')

## Part 3

Replicating the lab on NYC:

In [116]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

Forming the map with label markers: 

In [117]:
mp = folium.Map(location=[43.6532, -79.3832], zoom_start=9.5)

# add markers to map
for pc, lat, lng, borough, neighborhood in zip(df['Postcode'], df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{} | {} | {} '.format(pc, neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(mp)  
    
mp

Defining Foursquare credentials:

In [118]:
CLIENT_ID = 'EBZBKBMHOC0LNVSTB3HQ5HSFKGJGSZJ0X2N2QR4D4YDBTUMI' # your Foursquare ID
CLIENT_SECRET = 'ASKVML0XTCDWA3CVSFHM4YKXHNZWHYZC4GMVWXQNUWRBZUQZ' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 10

Defining function that gives a data frame for top 10 venues for each neighbourhood within a 500 m radius

In [119]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Using the defined function:

In [120]:
toronto_venues = getNearbyVenues(names=df['Neighbourhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Parkwoods
Victoria Village
Harbourfront,Regent Park
Lawrence Heights,Lawrence Manor
Queen's Park
Islington Avenue
Rouge,Malvern
Don Mills North
Woodbine Gardens,Parkview Hill
Ryerson,Garden District
Glencairn
Cloverdale,Islington,Martin Grove,Princess Gardens,West Deane Park
Highland Creek,Rouge Hill,Port Union
Flemingdon Park,Don Mills South
Woodbine Heights


KeyError: 'groups'

In [121]:
# one hot encoding
oh = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
oh['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [oh.columns[-1]] + list(oh.columns[:-1])
oh = oh[fixed_columns]

Grouping One Hot encoded dataframe:

In [122]:
gp = oh.groupby('Neighborhood').mean().reset_index()

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Terminal,American Restaurant,Arts & Crafts Store,...,Theme Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North,L'Amoreaux East,Milliken,Steel...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Alderwood,Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Willowdale West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,"Woodbine Gardens,Parkview Hill",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0


Defining function for most popular venues, **adapted from the Lab, as is most of the complex coding you see in this notebook**:

In [123]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = gp['Neighborhood']

for ind in np.arange(gp.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(gp.iloc[ind, :], num_top_venues)

K-Means Clustering Algorithm to Cluster Neighborhoods:

In [124]:
# set number of clusters
kclusters = 5

cl = gp.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cl)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 2, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Merging Results with the original dataframe, `df`

In [135]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

merge = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
merge = merge.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood', how='right')

merge.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,2,Park,Food & Drink Shop,Women's Store,Diner,Event Space,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Dog Run
1,M4A,North York,Victoria Village,43.725882,-79.315572,0,Grocery Store,Coffee Shop,Portuguese Restaurant,Hockey Arena,Dessert Shop,Event Space,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Drugstore
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636,0,Breakfast Spot,Historic Site,Park,Coffee Shop,Pub,Restaurant,Spa,Bakery,Gym / Fitness Center,Furniture / Home Store
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763,0,Clothing Store,Boutique,Coffee Shop,Event Space,Furniture / Home Store,Vietnamese Restaurant,Accessories Store,Cuban Restaurant,Discount Store,Fabric Shop
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494,0,Coffee Shop,Yoga Studio,Italian Restaurant,Mexican Restaurant,Portuguese Restaurant,Park,Creperie,Gym,Cuban Restaurant,Empanada Restaurant


Visualizing the Clusters on map:

In [137]:
# create map
cmap = folium.Map(location=[43.6532, -79.3832], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(merge['Latitude'], merge['Longitude'], merge['Neighbourhood'], merge['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(cmap)
       
cmap

# Observations

1. Most neighborhoods fall in **Cluster 0**
2. The category 0 clusters are dense near the coast