# Capstone project -- week 3 -- download neighborhoods table

This is the first of three notebooks requested in the week 3 assignment of the Capstone project Coursera course. The first task to be addressed is to download a table of postcodes for Toronto neighborhoods from Wikipedia. The link was provided in the submission instructions of the assignment. The resulting dataframe will be stored locally so that it then can be used for the other two parts of the assignment.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

WIKI_PAGE_TORONTO = r"https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
OUTPUT_FILE = r"../data/postal_codes_toronto.csv"

In [2]:
source = requests.get(WIKI_PAGE_TORONTO)
soup = BeautifulSoup(source.content,'lxml')
table = soup.find_all('table')
df_toronto = pd.read_html(str(table))[0]
df_toronto.head() 


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


While the above table contains all the entries as detailed in the Wikipedia page, the instructions further advise to:

* remove lines from the data frame where "Borough" is "Not assigned" and
* each postal code is listed only once with its neighborhoods separated by comma
* if a borough has been assigned, but the neighborhood is "Not assigned" then the name of the borough will be used as neighborhood name


In [3]:
# remove all rows from the dataframe where Borough is "Not assigned"
df_toronto.drop(df_toronto[df_toronto['Borough'] == 'Not assigned'].index, axis=0, inplace=True)

Now that all the "Not assigned" boroughs have been removed, the other two requirements can be addressed. Manually inspecting the data suggests that both incidences no longer occur in the data set. However, as things can be missed, let's do two tests to confirm that this observation is correct.

In [4]:
# check whether there are neighborhoods "Not assigned"
print(df_toronto[df_toronto['Neighborhood'] == 'Not assigned'])

Empty DataFrame
Columns: [Postal Code, Borough, Neighborhood]
Index: []


Querying the data frame for rows where "Neighborhood" is "Not assigned" returns an empty data set, i.e. there are no longer rows that have a "Not assigned" value that would need to be filtered out.

In [5]:
# check duplicates in post code
print(len(df_toronto['Postal Code'])) # printing all postal codes that are in data frame
print(len(set(list(df_toronto['Postal Code'])))) # printing only unique postal codes in data frame

103
103


The first print statement effectively determines how many rows there are in the data frame df_toronto. If multiple  entries for postal codes had the same value, printing the number of unique values in this column should be smaller. However, as both print statements return the same result, there are no longer duplicates of postal codes. Looking at the in the instructions referenced postal code 'M5A' indicates that the desired of comma-separated neighborhoods is already present in the data. Therefore, no further actions need to be taken and the shape can be plotted and the data saved for subsequent use.

In [6]:
print(df_toronto.shape)

(103, 3)


<p style="line-height:1.4"><b><font size=12>There is neighborhood information for 103 Toronto postal codes.</font></b><p>

In [7]:
df_toronto.to_csv(OUTPUT_FILE, index=False)