# Segmenting and Clustering Neighborhoods in Toronto

This notebook is intended for Capstone project week 3. 
In this assignment, we will explore, segment, and cluster the neighborhoods in the city of Toronto. We will collect the data from foursquare , make it in a structured format. Once the data is in a structured format, we will replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Before we download the data and start exploring it, let's download some dependencies needed by this notebook

In [4]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.3.1               |             py_0          25 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    altair-4.0.1               |             py_0         575 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.0 MB

The following NEW packages will be 

## 1. Download and Explore Dataset

First, we need a dataset file that contain Toronto neighbourhood (Toronto_Neigbourhood_Coordinate.csv) with fields as follow : Postal Code (PostCode), Borough, Neighbourhood, Latitude, Longitude .
If we run this notebook for the first time, we won't be able to find that file and this notebook will create that file with thise steps :
    - Get the list of postal code in Toronto from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M . We copy the list of postal code manually and save it in csv file (Toronto.csv), and upload it as asset so this notebook can access it. This file will contain attribute as follow : Postal Code (PostCode), Borough, Neighbourhood.
    - Load Toronto neighbourhood data from Toronto.csv to dataframe (df_Toronto)
    - Loop for each neighbourhood in df_Toronto and get each neighbourhood latitude and longitude using geopy.geocoders . After it finish looping, dataframe df_Toronto will contain attribute as follow : Postal Code (PostCode), Borough, Neighbourhood, latitude, longitude
    - Save the dataframe to csv file (Toronto_Neighbourhood_Coordinate
    

In [None]:
isTorontoFileExists = True

In [5]:

import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_c8e4850bd8dc439d8772288cbd371cd8 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='wmyv12NJK8swbJlynpLN0_WSnpZSQ6nrEmJPY5dbEOkh',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

bucket_id = 'phytonbasicfords-donotdelete-pr-ksk41itkqmstbb'

body = client_c8e4850bd8dc439d8772288cbd371cd8.get_object(Bucket=bucket_id,Key='Toronto.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
    
    
df_Toronto = pd.read_csv(body)
df_Toronto.head()


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M5R,Central Toronto,Yorkville
1,M3J,North York,York University
2,M2P,North York,York Mills West
3,M2L,North York,York Mills
4,M4C,East York,Woodbine Heights


After we got dataframe of Toronto's neighbourhood, we will try to retrieve dataframe for Toronto's geospatial based on the postal code. The data is retrieved from Toronto_Geospatial_Coordinates.csv

In [6]:

body = client_c8e4850bd8dc439d8772288cbd371cd8.get_object(Bucket=bucket_id,Key='Toronto_Geospatial_Coordinates.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_geospatial = pd.read_csv(body)

#rename the postal code because it will become a foreign key to dataframe Toronto
df_geospatial.rename(columns={"Postal Code": "Postcode"}, inplace = True )

df_geospatial.head()


Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We also need to get dataframe for Toronto Neighbourhood geospatial location. If it execute for the first time, it need to access geocoders first to get the latitude and longitude for each neighbourhood and save it to csv (Toronto_Neighbourhood_Coordinate.csv) file. If it executed for second time and so forth, we just need to read from csv and load it to dataframe.

In [7]:
#add latitude and longitude column
df_postalcode['latitude']=0.0
df_postalcode['longitude']=0.0
df_postalcode.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude
0,M5R,Central Toronto,Yorkville,0.0,0.0
1,M3J,North York,York University,0.0,0.0
2,M2P,North York,York Mills West,0.0,0.0
3,M2L,North York,York Mills,0.0,0.0
4,M4C,East York,Woodbine Heights,0.0,0.0


In [9]:
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="coursera_exam")

for index, row in df_postalcode.iterrows():
    #if index > 284:
        location = geolocator.geocode("{}, Toronto, Ontario".format(row['Neighbourhood']))
        
        if type(location) != type(None):
            df_postalcode.at[index,'latitude'] = location.latitude
            df_postalcode.at[index,'longitude'] = location.longitude 
        print ("{},{},{},{},{},{}".format( index, row['Postcode'] , row['Borough'] , row['Neighbourhood'], df_postalcode.at[index,'latitude'], df_postalcode.at[index,'longitude']  ))
        
print('finish')

0,M5R,Central Toronto,Yorkville,43.6713861,-79.3901677
1,M3J,North York,York University,43.7792419,-79.4835593
2,M2P,North York,York Mills West,43.7440391,-79.406657
3,M2L,North York,York Mills,43.7440391,-79.406657
4,M4C,East York,Woodbine Heights,43.6999302,-79.3191316
5,M4B,East York,Woodbine Gardens,43.7120785,-79.3025673
6,M1G,Scarborough,Woburn,43.7598243,-79.2252908
7,M3H,North York,Wilson Heights,43.7405195,-79.4400172
8,M2R,North York,Willowdale West,43.7615095,-79.4109234
9,M2N,North York,Willowdale South,43.7615095,-79.4109234
10,M2M,North York,Willowdale,43.7615095,-79.4109234
11,M1P,Scarborough,Wexford Heights,43.7432421,-79.304641
12,M1R,Scarborough,Wexford,43.7453767,-79.2947155
13,M9N,York,Weston,43.7001608,-79.5162474
14,M9P,Etobicoke,Westmount,43.6936399,-79.5210426
15,M1E,Scarborough,West Hill,43.7689144,-79.1872905
16,M9B,Etobicoke,West Deane Park,43.6631995,-79.5685684
17,M4A,North York,Victoria Village,43.732658,-79.3111892
18,M5L,Downtown Toronto,Victoria Hotel,4

165,M3J,North York,Northwood Park,43.7541351,-79.50448
166,M9W,Etobicoke,Northwest,43.6465466,-79.4195263
167,M4R,Central Toronto,North Toronto West,43.6465466,-79.4195263
168,M6L,North York,North Park,43.7186899,-79.4775337
169,M5R,Central Toronto,North Midtown,43.7051999,-79.39741552147335
170,M2M,North York,Newtonbrook,43.7938863,-79.42567902301055
171,M8V,Etobicoke,New Toronto,43.6007625,-79.505264
172,M9V,Etobicoke,Mount Olive,43.653963,-79.387207
173,M6M,York,Mount Dennis,43.6869597,-79.4895513
174,M1E,Scarborough,Morningside,43.7826012,-79.2049579
175,M4T,Central Toronto,Moore Park,43.6903876,-79.3832965
176,M8X,Etobicoke,Montgomery Road,43.6490313,-79.5188556
177,M8V,Etobicoke,Mimico South,43.6166773,-79.4968048
178,M8Z,Etobicoke,Mimico NW,43.6166773,-79.4968048
179,M8Y,Etobicoke,Mimico NE,43.6166773,-79.4968048
180,M1V,Scarborough,Milliken,43.8231743,-79.3017626
181,M1R,Scarborough,Maryvale,43.7590508,-79.3102297
182,M9R,Etobicoke,Martin Grove Gardens,43.6873105,-79.561967
183

In [None]:
outputFile = "Toronto_Neighbourhood_Coordinates.csv"


In [10]:

#This one doesn't work, so I will comment this
#df_postalcode.to_csv(outputFile, encoding="utf-8")
#with open(outputFile, 'rb') as data:
#     client_c8e4850bd8dc439d8772288cbd371cd8.upload_fileobj(data,  bucket_id, outputFile)
        
#print('finish')

finish


In [11]:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='539dfb05-329b-43a1-899d-d02c237b0ca3', project_access_token='p-d344efbe810a87640107d144b70e99fddc1e30f4')
pc = project.project_context

project.save_data(data=df_postalcode.to_csv(index=False),file_name=outputFile,overwrite=True)

print('finish')

finish


In [None]:

from project_lib import Project
project = Project(None, '**************', '**************')
pc = project.project_context

After we got dataframe of Toronto geospatial coordinate, we will merge dataframe of Toronto postal code list and dataframe Toronto geospatial coordinate so we have new dataframe contains Toronto postal code list and it's location

In [15]:
df_postalcode

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude
0,M5R,Central Toronto,Yorkville,43.671386,-79.390168
1,M3J,North York,York University,43.779242,-79.483559
2,M2P,North York,York Mills West,43.744039,-79.406657
3,M2L,North York,York Mills,43.744039,-79.406657
4,M4C,East York,Woodbine Heights,43.69993,-79.319132
5,M4B,East York,Woodbine Gardens,43.712078,-79.302567
6,M1G,Scarborough,Woburn,43.759824,-79.225291
7,M3H,North York,Wilson Heights,43.740519,-79.440017
8,M2R,North York,Willowdale West,43.76151,-79.410923
9,M2N,North York,Willowdale South,43.76151,-79.410923


In [8]:
df_TorontoPostalCodeGeospatial = pd.merge( df_postalcode, df_geospatial,  on='Postcode')
df_TorontoPostalCodeGeospatial.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M5R,Central Toronto,Yorkville,43.67271,-79.405678
1,M5R,Central Toronto,The Annex,43.67271,-79.405678
2,M5R,Central Toronto,North Midtown,43.67271,-79.405678
3,M3J,North York,York University,43.76798,-79.487262
4,M3J,North York,Northwood Park,43.76798,-79.487262


In [16]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here")
location = geolocator.geocode("The Annex, Toronto, Ontario")
print(location.address)
print(location.latitude, location.longitude)
#print(location.raw)

The Annex, University—Rosedale, Old Toronto, Toronto, Golden Horseshoe, Ontario, M5T 2E9, Canada
43.6703377 -79.407117
