**Our first step is to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:**

Let's begin by importing our libraries.  I made the choice to use BeautifulSoup to scrape the Wikipedia table due to its flexibility and my expectations for future utilization in more complex scraping.

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


I loaded the website into the wiki_url variable and reviewed the parsed html by printing the values with prettify.  For presentation purposes, I've commented out the print function due to the large amount of data produced but this was reviewed to properly understand the table I wanted to extract.

In [57]:
wiki_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(wiki_url,"lxml")
#print(soup.prettify())

In [58]:
table = soup.find('table')
df = pd.read_html(str(table))[0]
df.head()


Clean up our dataframe by removing all Not assigned Boroughs.  We will also merge any common postal codes and include all respective neighbourhoods in the same row. 

In [63]:
# Get names of invalid Boroughs for which column 'Not assigned'
notassigned = df[ df['Borough'] == 'Not assigned' ].index
 
# Delete these row indexes from dataFrame
df.drop(notassigned, inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Just to be sure I properly cleaned up the duplicate post codes, I checked the M5A code to be sure the data was merged as expected.

In [67]:
df = df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
#Double check I've properly merged the common postal codes
df.loc[df['Postcode'] == "M5A"]

Unnamed: 0,Postcode,Borough,Neighbourhood
53,M5A,Downtown Toronto,"Harbourfront, Regent Park"


**In step 2:  I took was to replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.**

Before adding in the coordinate values for each postal code, lets check the shape of our dataframe

In [68]:
df.shape

(103, 3)

Given that the Geocoder Python package can be very unreliable, in the interest of time and learning, I chose the option to use the following link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [70]:
csv_file = 'https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv'
geo_df =  pd.read_csv(csv_file)
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Next step was to combine this data with our neighbourhood dataframe.

In [74]:
combined_df = df.join(geo_df)
combined_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476
