
<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto City</font></h1>

## Introduction

1. In this lab, you will learn how to convert addresses into their equivalent latitude and longitude values. 

2. Also, you will use the Foursquare API to explore neighborhoods in Toronto City. You will get the location of each neighborhood

3. And then __use this location feature to group the neighborhoods into clusters__. You will use the *k*-means clustering algorithm to complete this task. 

4. Finally, you will use the Folium library to visualize the neighborhoods in Toronto City and their emerging clusters.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Explore Neighborhoods in Toronto City</a>

3. <a href="#item3">Analyze Each Neighborhood</a>

4. <a href="#item4">Cluster Neighborhoods</a>

5. <a href="#item5">Examine Clusters</a>    
</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np      # library to handle data in a vectorized manner

import pandas as pd     # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json             # library to handle JSON files

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans



#!conda install -c conda-forge geopy --yes           # uncomment to run this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim               # convert an address into latitude and longitude values

#!conda install -c conda-forge folium=0.5.0 --yes    # uncomment this line if you haven't completed the Foursquare API lab
import folium                                       # map rendering library


import requests                                     # library to handle requests
#from pandas.io.json import json_normalize           # tranform JSON file into a pandas dataframe


print('Libraries imported.')

Libraries imported.


<a id='item1'></a>

## 1. Download and Explore Dataset

In [6]:
import pandas as pd
#!pip3 install lxml
#!conda install -c conda-forge lxml --yes

from lxml import etree

In [7]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(url)
df=df[0]               # The first table (index 0) in the df is the table we want to retrieve

print(df.shape)
df.head()

(180, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [8]:
df=df.loc[df['Borough']!='Not assigned']
df=df.reset_index(drop=True)

print(df.shape)
df.head()

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


## Use geopy library to get the latitude and longitude values of Toronto City.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.

In [9]:
address = 'Toronto City, TOR'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)

latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.7370584, -79.2442535.


### Download the file containing location coordinates of each postal code

In [10]:
location_df=pd.read_csv('http://cocl.us/Geospatial_data')
print(location_df.shape)
location_df.head()

(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Create columns 'Latitude' and 'Longitude' in the original dataframe df, using information from location_df file just download

In [14]:
import pandas as pd
import numpy as np

#For testing only
#location_df.iloc[0,0]  
#a=location_df.iloc[0,0]  
#a=location_df.iloc[0]['Postal Code']

for row in df.index:
    postal=df['Postal Code'][row]
    
    lat=location_df.loc[location_df['Postal Code']==postal]['Latitude'].values
    long=location_df.loc[location_df['Postal Code']==postal]['Longitude'].values
    #df['Latitude']=1.3
    #df['Longitude']=1.3
    df['Latitude'][row]=lat
    df['Longitude'][row]=long
       
    
df = df.rename(columns={'Neighbourhood': 'Neighborhood'})   
print(df.dtypes)
print(df.shape)
df.head(4)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Postal Code      object
Borough          object
Neighborhood     object
Latitude        float64
Longitude       float64
dtype: object
(103, 5)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763


## Create a map of Toronto with Postal Location superimposed on top.

In [15]:
# create map of Toronto using latitude and longitude values
map_tor = folium.Map(location=[latitude, longitude], zoom_start=10.5)

# add markers to map
# for each of the 4-element group
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):    
   
    label = '{}, {}'.format(neighborhood, borough)   # For ex, label = 'Dien An, Dien Ban'
    label = folium.Popup(label, parse_html=True)
    
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tor)  
    
map_tor

# 3. Explore and get all Postal Codes'information in Toronto

In [16]:
import numpy as np
import pandas as pd
import requests    

CLIENT_ID = 'Q1SRIERBH0HDZQ3JHT2DBQPHHZUNCAKGVOJPRHX0IHOFGCXK' # your Foursquare ID
CLIENT_SECRET = 'W22UMOKGFOEX0BRKNZ2EBRUXX3CBLB5BWSJS2C0EB3GUCFOL' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT=106

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    


def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function and create a new dataframe called *toronto_venues*.

In [None]:
import requests    
toronto_venues = getNearbyVenues(names=df['Postal Code'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

#### Let's check the size of the resulting dataframe

In [None]:
print(toronto_venues.shape)
toronto_venues.head()

#### Let's find out how many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

<a id='item3'></a>

# 4. Analyze Each Neighborhood

In [None]:
# create a new dataframe, which contains one hot encoding for the Venue Category column (split into many columns)
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to the new dataframe
toronto_onehot['Postal Code'] = toronto_venues['Postal Code'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])

toronto_onehot = toronto_onehot[fixed_columns]

print(toronto_onehot.shape)
toronto_onehot.head()

And let's examine the new dataframe size.

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
toronto_grouped = toronto_onehot.groupby('Postal Code').mean().reset_index()
print(toronto_grouped.shape)
toronto_grouped.head()

#### Let's print each neighborhood along with the top 10 most common venues

In [None]:
num_top_venues = 10

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)
    
print(neighborhoods_venues_sorted.shape)    
neighborhoods_venues_sorted.head()

<a id='item4'></a>

# 5. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [None]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)           # 1 means axis=1 ?

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels (we have 5 cluster labels here 0,1,2,3,4) generated for each row (neighborhood) in the dataframe
kmeans.labels_ 

### Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
#neighborhoods_venues_sorted['Cluster Labels'] = pd.Series(dtype='int64')
# add clustering labels
a=neighborhoods_venues_sorted
a=a.insert(0, 'Cluster Labels', kmeans.labels_)

print(neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted.head()

In [None]:
print(df.shape)
df.head(5)

In [None]:
import copy

df2 = copy.copy(neighborhoods_venues_sorted)

df2.head(2)

In [None]:
e=[2.3]
f=[]
for row in df2.index:
    neigh=df2['Neighborhood'][row]
    
    lat=df.loc[df['Neighborhood']==neigh]['Latitude'].values
    #long=df.loc[df['Neighborhood']==neigh]['Longitude']
    print(lat)
    #e.append(lat)
    
    #f.append(long)
    #df2['Latitude']=1.3
    #df2['Longitude']=1.3
    #df2['Latitude'][row]=lat
    #df2['Longitude'][row]=long



#print(df2.shape)    
#df2.set_index('Neighborhood')

from pandas import DataFrame
mdf = DataFrame(e,columns=['Lati'])
#e.to_frame()
mdf[0:4]


In [None]:
#e = pd.DataFrame(index=index, columns=columns)df_ = pd.DataFrame(index=index, columns=columns)


In [None]:
#m=df2['Latitude','Longit

e=df2['Latitude','Longitude']
a=2

for row in df2.index:
    neigh=df2['Neighborhood'][row]
    
    lat=df.loc[df['Neighborhood']==neigh]['Latitude']
    long=df.loc[df['Neighborhood']==neigh]['Longitude']
    
    #print("Long" ,long)
    
    #print(type(lat))
    e.loc[row].at['Latitude']=lat
    #f[row]=long
    

    #df2.iloc[row]['Latitude']=1
    #df2[row]['Longitude']=2
    

#print(df2.shape)  
#df2.head(30)
m.head(30)


In [None]:
toronto_merged=df2

#merge toronto_grouped with df to add latitude/longitude for each neighborhood
#toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.set_index('Neighborhood',inplace=False)
print(toronto_merged.shape)
toronto_merged.head()
df5=df2
#df5.head(50)

Finally, let's visualize the resulting clusters

In [None]:
# create map of Toronto using latitude and longitude values
map_tor = folium.Map(location=[latitude, longitude], zoom_start=10.5)

# add markers to map
# for each of the 4-element group
for lat, lng, cluster, neighborhood in zip(df5['Latitude'], df5['Longitude'], df5['Cluster Labels'], df5['Neighborhood']):    
   
    label = '{}'.format(neighborhood)   # For ex, label = 'Dien An'
    label = folium.Popup(label, parse_html=True)
    
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tor)  
    
map_tor

In [None]:
# create map of Toronto using latitude and longitude values
map_tor = folium.Map(location=[latitude, longitude], zoom_start=10.5)

# add markers to map
# for each of the 4-element group
for lat, lng, neighborhood in zip(df5['Latitude'], df5['Longitude'], df5['Neighborhood']):    
   
    label = '{}'.format(neighborhood)   # For ex, label = 'Dien An'
    label = folium.Popup(label, parse_html=True)
    
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tor)  
    
map_tor

In [None]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)


# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


# add markers to the map
markers_colors = []

for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    co=rainbow[cluster-1]
    
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,parse_html=False).add_to(map_clusters)
    #print(cluster)
    
map_clusters

In [18]:
# set number of clusters
kcluster = 5

import copy

dfnow = copy.copy(df)
dfnow = dfnow[['Latitude','Longitude']]

# run k-means clustering
kmean = KMeans(n_clusters=kcluster, random_state=0).fit(dfnow)

# check cluster labels (we have 5 cluster labels here 0,1,2,3,4) generated for each row (neighborhood) in the dataframe
kmean.labels_ 

array([4, 4, 2, 3, 2, 1, 0, 4, 4, 2, 3, 1, 0, 4, 4, 2, 2, 1, 0, 4, 2, 2,
       0, 4, 2, 2, 0, 3, 3, 4, 2, 2, 0, 3, 3, 4, 2, 2, 0, 3, 3, 4, 2, 2,
       4, 3, 1, 4, 2, 1, 1, 0, 3, 1, 4, 3, 1, 1, 4, 3, 1, 3, 3, 1, 1, 0,
       3, 3, 2, 1, 1, 0, 3, 3, 2, 2, 1, 1, 0, 2, 2, 1, 0, 2, 2, 0, 2, 2,
       1, 1, 0, 2, 2, 1, 1, 0, 2, 2, 1, 2, 4, 1, 1], dtype=int32)

In [19]:
# add clustering labels
dfmap = copy.copy(df)
dfmap.insert(0, 'Cluster Labels', kmean.labels_)
dfmap.head()

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,4,M3A,North York,Parkwoods,43.753259,-79.329656
1,4,M4A,North York,Victoria Village,43.725882,-79.315572
2,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,2,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [21]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)


# set color scheme for the clusters
x = np.arange(kcluster)
ys = [i + x + (i*x)**2 for i in range(kcluster)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


# add markers to the map
markers_colors = []

for lat, lon, poi,cluster in zip(dfmap['Latitude'], dfmap['Longitude'], dfmap['Postal Code'],dfmap['Cluster Labels']):

    label = folium.Popup(str(poi), parse_html=True)
    co=rainbow[cluster-1]
    
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=co,
        fill=True,
        fill_color=co,
        fill_opacity=0.7,parse_html=False).add_to(map_clusters)
    #print(cluster)
    
map_clusters