Week 3 Final Assignment Part 1 - Segmenting and Clustering Neighbourhoods in Toronto, ON.

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe as per example in the project instructions.

Detailed steps below


Step 1: 3 COLUMNS ARE CREATED FOR THIS DATAFRAME : PostalCode, Borough and Neighborhood

In [1]:
import numpy as np # library to handle data in a vectorized manner

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!pip install folium
import folium # map rendering library

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

print('Libraries imported.')

Libraries imported.


In [2]:
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header=0)

# I did not use Beatiful Soup for this as I felt this would be easier and faster
required_cols = ['Postal Code', 'Borough','Neighbourhood']

# Note I converted the boolean output of array_equal function to String as it 
# wasn't working as expected when used in the if condition as expected
for table in tables:
    if(str(np.array_equal(np.array(table.columns),np.array(required_cols)))=="True"):
        pstl_data_df = pd.DataFrame(table)    
    break
print("Shape of Dataframe is - ", pstl_data_df.shape)
pstl_data_df.head()

Shape of Dataframe is -  (180, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Step 2: Process only the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [3]:
#Filtering out Boroughs which are Not assigned
pstl_data_df = pstl_data_df[pstl_data_df.Borough!="Not assigned"]
print("Shape of Dataframe is - ",pstl_data_df.shape)
pstl_data_df.head()

Shape of Dataframe is -  (103, 3)


Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Step 3: If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 

In [4]:
pstl_data_df['new_nghbr'] = np.where(pstl_data_df['Neighbourhood']=='Not assigned',pstl_data_df['Borough'],pstl_data_df['Neighbourhood'])
pstl_data_df.head(11)

Unnamed: 0,Postal Code,Borough,Neighbourhood,new_nghbr
2,M3A,North York,Parkwoods,Parkwoods
3,M4A,North York,Victoria Village,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront","Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights","Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government","Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village","Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge","Malvern, Rouge"
11,M3B,North York,Don Mills,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens","Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson","Garden District, Ryerson"


Step 4: More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

Used the "new_nghbr" column to get a grouped column, as in wherever the same Postal Code and the same Borough are present we have to club the Neighbourhood column using the functions groupby and apply

In [5]:
can_postal_cd_df = pd.DataFrame(pstl_data_df.groupby(['Postal Code','Borough'])['new_nghbr'].apply(','.join).reset_index())

# Renaming Columns to the same one in the image given:
can_postal_cd_df = can_postal_cd_df.rename(columns = {"Postal Code": "Postal Code","new_nghbr":"Neighborhood"})

In [6]:
print(can_postal_cd_df.head())
print("\n The Final Shape of the dataframe is  - ",can_postal_cd_df.shape)

  Postal Code      Borough                            Neighborhood
0         M1B  Scarborough                          Malvern, Rouge
1         M1C  Scarborough  Rouge Hill, Port Union, Highland Creek
2         M1E  Scarborough       Guildwood, Morningside, West Hill
3         M1G  Scarborough                                  Woburn
4         M1H  Scarborough                               Cedarbrae

 The Final Shape of the dataframe is  -  (103, 3)


As We see above the shape of the new data frame created is (103,3)

## PART 1 ENDS HERE

## PART 2 ADDING THE LAT LONG DATA TO THE EXISTING DATAFRAME

Step 1: Installing Geocoder

In [7]:
!pip install geocoder

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 8.2 MB/s  eta 0:00:01
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [8]:
import geocoder # import geocoder
print("Geo Coder imported!")

Geo Coder imported!



Creating a function to get the Lat Long data from the Postal Code

In [9]:
def get_geocoder(postal_code_from_df):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code_from_df.strip()))
        lat_lng_coords = g.latlng
        latitude = lat_lng_coords[0]
        longitude = lat_lng_coords[1]
    return latitude,longitude

Adding the Latitude and Longitude columns to the Pandas DataFrame

In [10]:
can_postal_cd_df['Latitude'], can_postal_cd_df['Longitude'] = zip(*can_postal_cd_df['Postal Code'].apply(get_geocoder))
can_postal_cd_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.81139,-79.19662
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.78574,-79.15875
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.76575,-79.1747
3,M1G,Scarborough,Woburn,43.76812,-79.21761
4,M1H,Scarborough,Cedarbrae,43.76944,-79.23892


In [11]:
print("Shape of the dataframe is - ",can_postal_cd_df.shape)

Shape of the dataframe is -  (103, 5)


## Map of Toronto

In [12]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_ontario")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto, Ontario are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto, Ontario are 43.6534817, -79.3839347.


Using Folium to Map

In [13]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, long, post, borough, neigh in zip(can_postal_cd_df['Latitude'], can_postal_cd_df['Longitude'], can_postal_cd_df['PostalCode'], can_postal_cd_df['Borough'], can_postal_cd_df['Neighborhood']):
    label = "{} ({}): {}".format(borough, post, neigh)
    popup = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=popup,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
map_toronto

KeyError: 'PostalCode'