#  Segmenting and Clustering Neighborhoods in Toronto (Part 1)

This notebook is part of Applied Data Science Capstone course @Coursera.

First step, we need to download a driver for Selenium to use.

In [5]:
import wget
wget.download('https://github.com/mozilla/geckodriver/releases/download/v0.27.0/geckodriver-v0.27.0-win64.zip', 'gecko.zip')

'gecko.zip'

In [8]:
from zipfile import ZipFile
with ZipFile('gecko.zip', 'r') as z: z.extractall('.')

We now scrape https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M for postal code data. The table has a unique class of "wikitable sortable" which we can use to identify element.

In [4]:
from selenium import webdriver

driver = webdriver.Firefox(executable_path='geckodriver.exe')
driver.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

The assignment mentioned duplicate Postal Code. That issue isn't presented in the source anymore as of the day this notebook was created (31 Oct 2020) which means we can simply loop over table td without any checks required.

The miss-match between Berough and Neighborhood where only 1 would be "Not assigned" is also not presented.

In [5]:
table = driver.find_element_by_class_name('wikitable')
rows = table.find_elements_by_tag_name('tr')

data = []
for row in rows[1:]:
    code, borough, neighbour = row.find_elements_by_tag_name('td')
    code = code.text
    borough = borough.text
    neighbour = neighbour.text
    if 'Not assigned' in borough or 'Not assigned' in neighbour:
        continue
    data.append([code, borough, neighbour])
driver.quit()

In [6]:
import pandas as pd
df = pd.DataFrame(data, columns=['PostalCode','Borough','Neighborhood'])
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Next step is to incorporate latitude/longitude into the data frame. I had given up attempt to use geocoder as I was getting None value every single time so instead I downloaded the provided CSV file.

In [1]:
import wget
wget.download('http://cocl.us/Geospatial_data', 'coords.csv')

'coords.csv'

Let's take a look at the data downloaded

In [7]:
coords = pd.read_csv('coords.csv')
coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge this with data frame from previously

In [8]:
df2 = pd.merge(df, coords, left_on=['PostalCode'],right_on=['Postal Code'],how='left')
df2.drop('Postal Code', axis=1, inplace=True)
df2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Saving to csv so we can reuse this data in the next part without having to download all over again.

In [10]:
df2.to_csv('Canada_Postal.csv')

End of part 1. Part 2 will be all about segmentation.