# Segmenting and Clustering Neighborhoods in Toronto

This notebook contains the operations to obtain and manipulate geographical data for Toronto neighourhoods. It is the week three assignment in the Coursera Data Science Capstone project.

In [1]:
import pandas as pd
import numpy as np

import requests
from bs4 import BeautifulSoup
import html5lib

from IPython.display import Image 
from IPython.core.display import HTML 
from pandas.io.json import json_normalize

import folium 

## Import data from Wikipedia

Per the instructions in the assignment, the neighourhood names and postal codes can be scraped from [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).


In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

The table that contains the data is not named so it can't be easily found by string matching. From visual inspection of the wiki-page, however, it looks like there are not too many tables present. It would, therefore not be too costly to read all of them directly into dataframes.

In [3]:
dataframe_list = pd.read_html(url, flavor='bs4')
len(dataframe_list)

3

The initial asssessment that not too many tables are present on the Wikipedia site is correct. By trial and error (which is feasible since only three frames have to be viewed), '0' is found to be the correct index for the table.

In [4]:
toronto_nbhs = dataframe_list[0]
toronto_nbhs


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


## Clean the data

First drop all the rows with unassigned boroughs. This can be done by selecting only the rows in which the borough field is not labelled 'Not assigned'.

In [5]:
# create a  dataframe without unassigned boroughs
toronto_nbhs = toronto_nbhs[toronto_nbhs['Borough']!='Not assigned'] 
toronto_nbhs.reset_index(inplace=True, drop=True)

Then make sure that all the neighbourhoods that share the same postal code are merged.

In [6]:
# Check how many postal codes have been assigned to more than one neighbourhood
toronto_nbhs['Postal Code'].describe(include='all')

count     103
unique    103
top       M9C
freq        1
Name: Postal Code, dtype: object

There are as many unique postal codes (103) as there are entries (103). Apparently, all the postal codes are allready uniquely assigned to a neighbourhood _entry_.

It may be the case that a neigbourhood entry already combines multiple neighbourhoods with the same postal code. This can be visually verified.

In [7]:
toronto_nbhs.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Lastly, fix neighbourhood names that are marked 'Not assigned' by assigning them the name of their borough.

In [8]:
# Check how many neighbourhood names need fixing.
(toronto_nbhs['Neighbourhood']=='Not assigned').sum()

0

Apparently, none of the neighbourhood name entries need fixing.

This concludes the cleaning of the dataframe, as per the assignments instructions.

In [9]:
toronto_nbhs.shape

(103, 3)

In [10]:
# Save the cleaned dataframe as a '.csv'
path = '~/Documents/Projects/Coursera-Capstone/Neighbourhoods.csv'

toronto_nbhs.to_csv(path)