# Getting Dataset of Toronto

This notebook is intended for Capstone project week 3. 
In this assignment, we will collect data of Toronto city which has Postal Code, Borough, Neighborhood, Latitude , and Longitude. First, to get the postal cod, borough, we will get the list from wikipedia Toronto ( https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M ) . We will copy the list of postal code manually and save it in csv file (Toronto.csv) . After that, we will get the latitude and longitude for each neighborhood.

Before we download the data and start exploring it, let's download some dependencies needed by this notebook

In [0]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

/bin/bash: conda: command not found
Libraries imported.


## 1. Download and Explore Dataset

First, we need a dataset file that contain Toronto neighbourhood (Toronto_Neigbourhood_Coordinate.csv) with fields as follow : Postal Code (PostCode), Borough, Neighbourhood, Latitude, Longitude .
If we run this notebook for the first time, we won't be able to find that file and this notebook will create that file with thise steps :
*   Get the list of postal code in Toronto from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M . We copy the list of postal code manually and save it in csv file (Toronto.csv), and upload it as asset so this notebook can access it. This file will contain attribute as follow : Postal Code (PostCode), Borough, Neighbourhood.
*   Load Toronto neighbourhood data from Toronto.csv to dataframe (df_Toronto)
*   Loop for each neighbourhood in df_Toronto and get each neighbourhood latitude and longitude using geopy.geocoders . After it finish looping, dataframe df_Toronto will contain attributes as follow : Postal Code (PostCode), Borough, Neighbourhood, latitude, longitude
*   Save the dataframe to csv file (Toronto_Neighbourhood_Coordinate.csv)

After this notebook is run for the first time, we will have a ready dataset that contains all attribute that needed in this notebook to be analyzed. So in the second time this notebook is run, it just need to read Toronto_Neighbourhood_Coordinate.csv to load those data into dataframe

### 1.1. Prepare dataset  

In [0]:
isTorontoNeighbourghoodCoordFileExists = True #refer to Toronto_Neighbourhood_Coordinate.csv. If it exists, change manually to 'False'. If it doesn't exists, change manually to 'True'

#if isTorontoNeighbourghoodCoordFileExists = False, set isRunInIBMWatson or isRunInGoogleColab
isRunInIBMWatson = False #if this notebook run in IBM Watson, change manually to 'True'. If it doesn't run in IBM Watson, change manually to 'False' . If it run in IBM Watson, it will need additional script to read/write csv as asset
isRunInGoogleColab = True #if this notebook run in Google Colab, change manually to 'True'. If it doesn't run in Google Colab, change manually to 'False' . If it run in IBM Watson, it will need additional script to read/write csv as asset

#if isTorontoNeighbourghoodCoordFileExists = True, get the csv from github Maria Triagni and set isReadCsvFromGithub to True
isReadCsvFromGithub = True #if Toronto_Neighbourhood_Coordinate.csv exits in Maria Triagni's github, change manually to 'True'. If it doesn't exists, change mannually to 'False'

torontoFile = "Toronto.csv"
torontoNeighbourghoodCoordFile = "Toronto_Neighbourhood_Coordinates.csv"


#### 1.1.1. If we run using IBM Watson and toronto.csv is in asset

In [0]:
if ( (isTorontoNeighbourghoodCoordFileExists != True ) and (isRunInIBMWatson == True)) :
  import types
  import pandas as pd
  from botocore.client import Config
  import ibm_boto3

  def __iter__(self): return 0

  # @hidden_cell
  # The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
  # You might want to remove those credentials before you share the notebook.
  client_c8e4850bd8dc439d8772288cbd371cd8 = ibm_boto3.client(service_name='s3',
      ibm_api_key_id='wmyv12NJK8swbJlynpLN0_WSnpZSQ6nrEmJPY5dbEOkh',
      ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
      config=Config(signature_version='oauth'),
      endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

  bucket_id = 'phytonbasicfords-donotdelete-pr-ksk41itkqmstbb'

  body = client_c8e4850bd8dc439d8772288cbd371cd8.get_object(Bucket=bucket_id,Key=torontoFile)['Body']
  # add missing __iter__ method, so pandas accepts body as file-like object
  if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
      
  df_Toronto = pd.read_csv(body)
  df_Toronto.head()

print('Finish')

Finish


#### 1.1.2. If we run using Google Colab and toronto.csv is in github

In [0]:
if ( (isTorontoNeighbourghoodCoordFileExists != True ) and (isReadCsvFromGithub == True)) :
  import pandas as pd 

  url = "https://raw.githubusercontent.com/MariaTriagni/Data/master/" + torontoFile
  print(url)
  df_Toronto = pd.read_csv(url) 

df_Toronto.head() 

https://raw.githubusercontent.com/MariaTriagni/Data/master/Toronto.csv


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M5R,Central Toronto,Yorkville
1,M3J,North York,York University
2,M2P,North York,York Mills West
3,M2L,North York,York Mills
4,M4C,East York,Woodbine Heights


#### 1.1.3. Create latitude and longitude for each neighborhood

In [0]:
if (isTorontoNeighbourghoodCoordFileExists != True ) :
  #add latitude and longitude column
  df_Toronto['latitude']=0.0
  df_Toronto['longitude']=0.0

df_Toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude
0,M5R,Central Toronto,Yorkville,0.0,0.0
1,M3J,North York,York University,0.0,0.0
2,M2P,North York,York Mills West,0.0,0.0
3,M2L,North York,York Mills,0.0,0.0
4,M4C,East York,Woodbine Heights,0.0,0.0


In [0]:
#loop for each neighborhood to get latitude and longitude
if (isTorontoNeighbourghoodCoordFileExists != True ) :

  from geopy.geocoders import Nominatim

  geolocator = Nominatim(user_agent="coursera_exam")

  for index, row in df_Toronto.iterrows():
    #if index > 284:
        location = geolocator.geocode("{}, Toronto, Ontario".format(row['Neighbourhood']))
        
        if type(location) != type(None):
            df_Toronto.at[index,'latitude'] = location.latitude
            df_Toronto.at[index,'longitude'] = location.longitude 
        print ("{},{},{},{},{},{}".format( index, row['Postcode'] , row['Borough'] , row['Neighbourhood'], df_Toronto.at[index,'latitude'], df_Toronto.at[index,'longitude']  ))
        
df_Toronto.head()

285,M1S,Scarborough,Agincourt,43.7853531,-79.2785494
286,M5H,Downtown Toronto,Adelaide,43.65082325,-79.37793584643234


Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude
0,M5R,Central Toronto,Yorkville,43.671386,-79.390168
1,M3J,North York,York University,43.779242,-79.483559
2,M2P,North York,York Mills West,43.744039,-79.406657
3,M2L,North York,York Mills,43.744039,-79.406657
4,M4C,East York,Woodbine Heights,43.69993,-79.319132


#### 1.1.4. Save the dataset to file Toronto_Neighbourhood_Coordinates.csv

In [0]:

#This one doesn't work, so I will comment this
#df_postalcode.to_csv(outputFile, encoding="utf-8")
#with open(outputFile, 'rb') as data:
#     client_c8e4850bd8dc439d8772288cbd371cd8.upload_fileobj(data,  bucket_id, outputFile)
        
#print('finish')

finish


In [0]:
# @hidden_cell
#save dataframe to csv file in asset if running in IBM Watson
if ( (isTorontoNeighbourghoodCoordFileExists != True ) and (isRunInIBMWatson == True)) :
  from project_lib import Project
  project = Project(project_id='539dfb05-329b-43a1-899d-d02c237b0ca3', project_access_token='p-d344efbe810a87640107d144b70e99fddc1e30f4')
  pc = project.project_context
  project.save_data(data=df_postalcode.to_csv(index=False),file_name=torontoNeighbourghoodCoordFile,overwrite=True)

print('finish')

finish


In [0]:
#save dataframe to csv file in google drive if running in Google Colab 
if ( (isTorontoNeighbourghoodCoordFileExists != True ) and (isRunInGoogleColab == True)) :
  from google.colab import drive
  drive.mount('drive')
  data.to_csv(torontoNeighbourghoodCoordFile)
  !cp test.csv "drive/My Drive/Colab Notebooks"

Download the file Toronto_Neighbourhood_Coordinates.csv manually and we save it to github with account 'MariaTriagni' . Next time we need that dataset, we just need to read it from github with url : https://github.com/MariaTriagni/Data/blob/master/Toronto_Neighbourhood_Coordinates.csv

#### 1.1.5. Read dataset from file Toronto_Neighbourhood_Coordinates.csv from github 

Toronto_Neighbourhood_Coordinate.csv is now available in github account 'MariaTriagni' ( https://github.com/MariaTriagni/Data/blob/master/Toronto_Neighbourhood_Coordinates.csv ) .After that, all we need to do is just upload to dataframe

In [0]:
#If Toronto_Neighbourhood_Coordinate is exists already and it's available in github, load the csv from github to dataframe df_Toronto
if ( (isTorontoNeighbourghoodCoordFileExists == True) and (isReadCsvFromGithub == True)) :
  import pandas as pd 

  url = "https://raw.githubusercontent.com/MariaTriagni/Data/master/" + torontoNeighbourghoodCoordFile
  print(url)
  df_Toronto = pd.read_csv(url) 

df_Toronto.head() 


https://raw.githubusercontent.com/MariaTriagni/Data/master/Toronto_Neighbourhood_Coordinates.csv


Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude
0,M5R,Central Toronto,Yorkville,43.671386,-79.390168
1,M3J,North York,York University,43.779242,-79.483559
2,M2P,North York,York Mills West,43.744039,-79.406657
3,M2L,North York,York Mills,43.744039,-79.406657
4,M4C,East York,Woodbine Heights,43.69993,-79.319132


In [0]:
df_Toronto

Unnamed: 0,Postcode,Borough,Neighbourhood,latitude,longitude
0,M5R,Central Toronto,Yorkville,43.671386,-79.390168
1,M3J,North York,York University,43.779242,-79.483559
2,M2P,North York,York Mills West,43.744039,-79.406657
3,M2L,North York,York Mills,43.744039,-79.406657
4,M4C,East York,Woodbine Heights,43.699930,-79.319132
...,...,...,...,...,...
282,M8W,Etobicoke,Alderwood,43.601717,-79.545232
283,M9V,Etobicoke,Albion Gardens,43.741665,-79.584543
284,M1V,Scarborough,Agincourt North,43.808038,-79.266439
285,M1S,Scarborough,Agincourt,43.785353,-79.278549
