# Segmenting and Clustering Neighborhoods in Toronto

This notebook contains the operations to obtain and manipulate geographical data for Toronto neighourhoods. It is the week three assignment in the Coursera Data Science Capstone project.

In [1]:
import pandas as pd
import numpy as np

import requests
import geocoder
#from bs4 import BeautifulSoup
#import html5lib

#from IPython.display import Image 
#from IPython.core.display import HTML 
#from pandas.io.json import json_normalize

#import folium 

## Part 1: Import data from Wikipedia

Per the instructions in the assignment, the neighourhood names and postal codes can be scraped from [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).


In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

The table that contains the data is not named so it can't be easily found by string matching. From visual inspection of the wiki-page, however, it looks like there are not too many tables present. It would, therefore not be too costly to read all of them directly into dataframes.

In [3]:
dataframe_list = pd.read_html(url, flavor='bs4')
len(dataframe_list)

3

The initial asssessment that not too many tables are present on the Wikipedia site is correct. By trial and error (which is feasible since only three frames have to be viewed), '0' is found to be the correct index for the table.

In [4]:
toronto_nbhs = dataframe_list[0]
toronto_nbhs


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


### Clean the data

First drop all the rows with unassigned boroughs. This can be done by selecting only the rows in which the borough field is not labelled 'Not assigned'.

In [5]:
# create a  dataframe without unassigned boroughs
toronto_nbhs = toronto_nbhs[toronto_nbhs['Borough']!='Not assigned'] 
toronto_nbhs.reset_index(inplace=True, drop=True)

Then make sure that all the neighbourhoods that share the same postal code are merged.

In [6]:
# Check how many postal codes have been assigned to more than one neighbourhood
toronto_nbhs['Postal Code'].describe(include='all')

count     103
unique    103
top       M6M
freq        1
Name: Postal Code, dtype: object

There are as many unique postal codes (103) as there are entries (103). Apparently, all the postal codes are allready uniquely assigned to a neighbourhood _entry_.

It may be the case that a neigbourhood entry already combines multiple neighbourhoods with the same postal code. This can be visually verified.

In [7]:
toronto_nbhs.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Lastly, fix neighbourhood names that are marked 'Not assigned' by assigning them the name of their borough.

In [8]:
# Check how many neighbourhood names need fixing.
(toronto_nbhs['Neighbourhood']=='Not assigned').sum()

0

Apparently, none of the neighbourhood name entries need fixing.

The dataframe can be 'described' for a quick sanity check.

In [9]:
toronto_nbhs.describe(include='all')

Unnamed: 0,Postal Code,Borough,Neighbourhood
count,103,103,103
unique,103,11,99
top,M6M,North York,Downsview
freq,1,24,4


Note that apparently four distinct postal codes are associated with the neighbourhood Downsview.

This concludes the cleaning of the dataframe, as per the assignments instructions.

In [10]:
toronto_nbhs.shape

(103, 3)

In [11]:
# Save the cleaned dataframe as a '.csv'
path = '~/Documents/Projects/Coursera-Capstone/Neighbourhoods.csv'

toronto_nbhs.to_csv(path)

## Part 2: Add location data

### Location data using Python geocoder

First try to find location data using geocoder.

In [12]:
# test how many instances can not be geocoded instantaneously

i = 0 # initialise the counter

for postal_code in toronto_nbhs['Postal Code']:
    g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
    lat_lng_coords = g.latlng
    if lat_lng_coords is None:
        i = i + 1
        #print('{} could not be geocoded'.format(postal_code))
    else:
        lat = lat_lng_coords[0]
        lon = lat_lng_coords[1]
        #print(postal_code, lat, lon)
print('{} instances could not be geocoded'.format(i))

103 instances could not be geocoded


None of the postal codes could be geocoded in the first pass. The loop, as described in the assignment, is found to be time consuming and applying it on 103 instances is just not feasible.

As per the instructions, geographical coordinates will now be extracted from a [csv file](https://cocl.us/Geospatial_data).

### Location data using a csv file

In [13]:
path = '~/Documents/Projects/Coursera-Capstone/geodata/Geospatial_Coordinates.csv'
gsd = pd.read_csv(path)

gsd.sort_values(by=['Postal Code'], inplace=True) # Put the list in alphabetical order of the postal codes
gsd.reset_index(inplace=True, drop=True)
gsd.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [14]:
# Put the neighbourhood data in the same order as the geospatial coordinates
toronto_nbhs.sort_values(by=['Postal Code'], inplace=True)
toronto_nbhs.reset_index(inplace=True, drop=True)
toronto_nbhs.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  toronto_nbhs.sort_values(by=['Postal Code'], inplace=True)


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [15]:
toronto_nbhs = pd.concat([toronto_nbhs, gsd[['Latitude', 'Longitude']]], axis=1)

toronto_nbhs.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [16]:
toronto_nbhs.describe(include='all')

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
count,103,103,103,103.0,103.0
unique,103,11,99,,
top,M6M,North York,Downsview,,
freq,1,24,4,,
mean,,,,43.704608,-79.397153
std,,,,0.052463,0.097146
min,,,,43.602414,-79.615819
25%,,,,43.660567,-79.464763
50%,,,,43.696948,-79.38879
75%,,,,43.74532,-79.340923


This concludes the second part of the assignment.

In [17]:
# Save the cleaned dataframe as a '.csv'
path = '~/Documents/Projects/Coursera-Capstone/Neighbourhoods.csv'

toronto_nbhs.to_csv(path)