# Canada's Postcodes M

In this notebook I explore the Wikipedia data on Canadian neighbourhoods with the code M.

This is a notebook for the [IBM Data Science Cert](https://www.coursera.org/learn/applied-data-science-capstone/home/welcome)

We know we'll need to scrape the web and assign a table to a pandas dataframe. This clearly calls for beautifulsoup.

In [22]:
import urllib.request
from bs4 import BeautifulSoup as bs
import pandas as pd

## Get Data
First retrieve the HTML response from the Wikipedia page, then pass it to BeautifulSoup for parsing. Within this we can search for the table on the website and make sure it has the class 'wikitable sortable', because Wikipedia likes to be fancy.

In [23]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = urllib.request.urlopen(url)

In [24]:
soup = bs(page, "lxml")

In [25]:
borough_table=soup.find('table', class_='wikitable sortable')

## Process Scraped HTML
This is where the processing happens. Go through all rows `<tr>` and extract all cells `<td>`. Make these into a list and collect them in `l` the list of lists. Afterwards it's easy to create a Dataframe using the headers of the wikitable and the data from the rows. I decided against getting rid of "Not Assigned" right away and use pandas to clean these below. Note that there are actually no neighbourhoods "Not Assigned", where the Borough isn't also "Not Assigned".

In [26]:
l = []
non = "Not assigned"
for tr in borough_table.find_all('tr'):
    td = tr.find_all('td')
    if not td:
        headers = [tr.text.strip() for tr in tr.find_all('th')]
        continue
    row = [tr.text.strip() for tr in td]
    l.append(row)
canada_m = pd.DataFrame(l, columns=headers)
canada_m

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [27]:
canada_m = canada_m[canada_m.Borough!=non]
canada_m.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [28]:
canada_m.loc[canada_m.Neighborhood == non, 'Neighborhood'] = canada_m.loc[canada_m.Neighborhood == non, 'Borough']
canada_m.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [29]:
canada_m.shape

(103, 3)

## Geospatial Data
Now obtain geospatial data to add lat and long to the data. We can join these on the Postal Code

In [30]:
geospatial = pd.read_csv('http://cocl.us/Geospatial_data')
geospatial.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [31]:
canada_m = pd.merge(canada_m, geospatial, on="Postal Code")
canada_m.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## Exploratory Data Analysis

Have a little play with the data. Let's start with counting the postal codes per borough.

In [32]:
pd.value_counts(canada_m.Borough)

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East York            5
York                 5
East Toronto         5
Mississauga          1
Name: Borough, dtype: int64

## Maps

Time to create some maps. First with all data points in the M zip code. Then all Toronto Boroughs.

In [33]:
import folium

In [34]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[canada_m.Latitude.mean(), canada_m.Longitude.mean()], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(canada_m['Latitude'], canada_m['Longitude'], canada_m['Borough'], canada_m['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='purple',
        fill=True,
        fill_color='purple',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [35]:
# create map of New York using latitude and longitude values
map_dt_toronto = folium.Map(location=[canada_m.Latitude.mean(), canada_m.Longitude.mean()], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(canada_m['Latitude'], canada_m['Longitude'], canada_m['Borough'], canada_m['Neighborhood']):
    if not "Toronto" in borough:
        continue
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='purple',
        fill=True,
        fill_color='purple',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dt_toronto)  
    
map_toronto