## Introduction

This notebook is used to demonstrate how to explore, segment, and cluster the neighborhoods in the city of Toronto.

##### Download the dependent packages and libraries

In [1]:
import numpy as np # handling data in vectorized manner

import pandas as pd #for data analysis
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)

import json

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # to get latitude and logitute by giving address

import requests
from pandas.io.json import json_normalize # transform json file into pandas dataframe

#import matplotlib modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#import k-means for clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium

print('Libraries imported!')

Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.18.1               |             py_0          51 KB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          84 KB

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0     conda-forge

The following packages will be UPDATED:

    geopy:         1.11.0-py36_0 conda-forge --> 1.18.1-py_0 conda-forge


Downloading and Extracting Packages
geopy-1.18.1         | 51 KB     | ##################################### | 100% 
geographiclib-1.49   | 32 KB     | ##################################### | 100% 
Preparing transaction: done

##### Read and get data from wikipedia into dataframe which will consist of ***three columns: PostalCode, Borough, and Neighborhood***

In [2]:
#get the data from wikipedia into pandas dataframe
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
dfs_toronto = pd.read_html(url,header=0)

# get the dataframe from the dataframe list
df_toronto=dfs_toronto[0]
print(df_toronto.head(1))
print()

# set the column names
df_toronto.columns=['PostalCode','Borough','Neighborhood']

#check data and row count
print(df_toronto.head(15))
print()
print( df_toronto.shape[0])


  Postcode       Borough Neighbourhood
0      M1A  Not assigned  Not assigned

   PostalCode           Borough      Neighborhood
0         M1A      Not assigned      Not assigned
1         M2A      Not assigned      Not assigned
2         M3A        North York         Parkwoods
3         M4A        North York  Victoria Village
4         M5A  Downtown Toronto      Harbourfront
5         M5A  Downtown Toronto       Regent Park
6         M6A        North York  Lawrence Heights
7         M6A        North York    Lawrence Manor
8         M7A      Queen's Park      Not assigned
9         M8A      Not assigned      Not assigned
10        M9A         Etobicoke  Islington Avenue
11        M1B       Scarborough             Rouge
12        M1B       Scarborough           Malvern
13        M2B      Not assigned      Not assigned
14        M3B        North York   Don Mills North

289


##### Only process the cells that have an assigned borough. Ignore cells with a borough that is ***Not assigned***

In [3]:
# ignore rows where borough is Not assigned 
missing_borough=['Not assigned']

#removing the Not assigned Borough data in place
df_toronto = df_toronto[~df_toronto['Borough'].isin(missing_borough)].reset_index(drop=True)

#check result and row count
print(df_toronto.head(10))
print()
print(df_toronto.shape[0])
print()
print(df_toronto.tail(10))

  PostalCode           Borough      Neighborhood
0        M3A        North York         Parkwoods
1        M4A        North York  Victoria Village
2        M5A  Downtown Toronto      Harbourfront
3        M5A  Downtown Toronto       Regent Park
4        M6A        North York  Lawrence Heights
5        M6A        North York    Lawrence Manor
6        M7A      Queen's Park      Not assigned
7        M9A         Etobicoke  Islington Avenue
8        M1B       Scarborough             Rouge
9        M1B       Scarborough           Malvern

212

    PostalCode    Borough              Neighborhood
202        M8Y  Etobicoke                 Mimico NE
203        M8Y  Etobicoke            Old Mill South
204        M8Y  Etobicoke        The Queensway East
205        M8Y  Etobicoke     Royal York South East
206        M8Y  Etobicoke                  Sunnylea
207        M8Z  Etobicoke  Kingsway Park South West
208        M8Z  Etobicoke                 Mimico NW
209        M8Z  Etobicoke        The Qu

##### Combine into ***one row all neighbourhoods seperated by comma, which belong to the same postal code area***

In [4]:
df_toronto_new=pd.DataFrame( df_toronto.groupby(['PostalCode','Borough'], sort=False)['Neighborhood'].apply(lambda x: ', '.join(x)).reset_index(name='Neighborhood'))
print(df_toronto_new.shape)
print(df_toronto_new.head(10))
print(df_toronto_new.tail(10))

(103, 3)
  PostalCode           Borough                      Neighborhood
0        M3A        North York                         Parkwoods
1        M4A        North York                  Victoria Village
2        M5A  Downtown Toronto         Harbourfront, Regent Park
3        M6A        North York  Lawrence Heights, Lawrence Manor
4        M7A      Queen's Park                      Not assigned
5        M9A         Etobicoke                  Islington Avenue
6        M1B       Scarborough                    Rouge, Malvern
7        M3B        North York                   Don Mills North
8        M4B         East York   Woodbine Gardens, Parkview Hill
9        M5B  Downtown Toronto          Ryerson, Garden District
    PostalCode           Borough  \
93         M8W         Etobicoke   
94         M9W         Etobicoke   
95         M1X       Scarborough   
96         M4X  Downtown Toronto   
97         M5X  Downtown Toronto   
98         M8X         Etobicoke   
99         M4Y  Downtown

##### Do the following conversion: If a cell has a borough but a ***Not assigned neighborhood***, then ***the neighborhood will be the same as the borough***

In [5]:
df_toronto_new['Neighborhood'] = np.where(df_toronto_new['Neighborhood'] == 'Not assigned', df_toronto_new['Borough'], df_toronto_new['Neighborhood'])
print(df_toronto_new.shape)
print(df_toronto_new.head(10))


(103, 3)
  PostalCode           Borough                      Neighborhood
0        M3A        North York                         Parkwoods
1        M4A        North York                  Victoria Village
2        M5A  Downtown Toronto         Harbourfront, Regent Park
3        M6A        North York  Lawrence Heights, Lawrence Manor
4        M7A      Queen's Park                      Queen's Park
5        M9A         Etobicoke                  Islington Avenue
6        M1B       Scarborough                    Rouge, Malvern
7        M3B        North York                   Don Mills North
8        M4B         East York   Woodbine Gardens, Parkview Hill
9        M5B  Downtown Toronto          Ryerson, Garden District


##### Display the ***number of rows in the dataframe***

In [6]:
print("Number of rows and columns in the dataframe= ", df_toronto_new.shape)

Number of rows and columns in the dataframe=  (103, 3)
