<h1> Postal Codes in Canada </h1>

<h2> Part 1 </h2>

Firstly, import all the necessary libraries. 

In [89]:
# import all necessary libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

Now I can set url and request the html text from wikipedia. Then I will print the html to see what it looks like. 


In [90]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

html = requests.get(url).text
#html

We want to parse this mess to find the tables in the website. That's where our data will be. 

In [91]:
# parse the data and find_all of the tables on the website
soup = BeautifulSoup(html, 'html5lib')
tables = soup.find_all('table')
len(tables)

3

It appears there are a few tables on the webite, so we'll have to find the correct table index to get our info. 

In [92]:
# since there are 3 tables, find out which table the data is in
for index, table in enumerate(tables):
    if ("wikitable" in str(table)):
        table_index = index
print(table_index)

0


Now, we can get all the info needed and put it into a dataframe. 
First, make an empty dataframe named <code> postal_codes </code> 

Then, find all the <code> tr </code> and the <code> td </code> within the HTML, strip the necessary parts and pop them into the dataframe. 

In [93]:
# place all of the data into a dataframe

postal_codes = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])

for row in tables[table_index].tbody.find_all("tr"): 
    col = row.find_all("td")
    if (col != []):
        code = col[0].text.strip()
        borough = col[1].text.strip()
        neighborhood = col[2].text.strip()
        postal_codes = postal_codes.append({"PostalCode":code, "Borough":borough, "Neighborhood":neighborhood}, ignore_index=True)

postal_codes

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Save only the necessary data from the Borough, and lose the data that is not assigned or we can use

In [94]:
# get ride of all the Boroughs that are not assigned
postal_codes = postal_codes[postal_codes.Borough != 'Not assigned']

In [95]:
# reset the index 
postal_codes = postal_codes.reset_index(drop=True)
postal_codes.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [105]:
postal_codes = postal_codes.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(list)

postal_codes = postal_codes.sample(frac=1).reset_index()
postal_codes['Neighborhood'] = postal_codes['Neighborhood'].str.join(',')
postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M9L,North York,Humber Summit
1,M2J,North York,"Fairview, Henry Farm, Oriole"
2,M6P,West Toronto,"High Park, The Junction South"
3,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
4,M1B,Scarborough,"Malvern, Rouge"


In [106]:
postal_codes.shape

(103, 3)

That's about it. Now we have some data that we can sort through and actually use. 

<h2> Part 2 </h2>

In [71]:
pip install geocoder

You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [72]:
'''
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_codes))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

'''

"\nimport geocoder # import geocoder\n\n# initialize your variable to None\nlat_lng_coords = None\n\n# loop until you get the coordinates\nwhile(lat_lng_coords is None):\n  g = geocoder.google('{}, Toronto, Ontario'.format(postal_codes))\n  lat_lng_coords = g.latlng\n\nlatitude = lat_lng_coords[0]\nlongitude = lat_lng_coords[1]\n\n"

In [107]:
csv = 'Geospatial_Coordinates.csv'
lat_long = pd.read_csv('/Users/jacobgood/Desktop/MOOCs for Money/Coursera/5. Applied Data Science Capstone/Week 3 - k-Means/Final Project/Geospatial_Coordinates.csv')

Let's take a look at the datasets and them merge them together. 

In [108]:
lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [109]:
postal_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M9L,North York,Humber Summit
1,M2J,North York,"Fairview, Henry Farm, Oriole"
2,M6P,West Toronto,"High Park, The Junction South"
3,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
4,M1B,Scarborough,"Malvern, Rouge"


In [110]:
lat_long.shape, postal_codes.shape

((103, 3), (103, 3))

Great! They're the same shape. This will be easy. Let's make sure the Postal Codes are sorted the same. 

In [111]:
lat_long = lat_long.sort_values(by=['Postal Code'])
lat_long

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [112]:
postal_codes = postal_codes.sort_values(by=['PostalCode'])
postal_codes

Unnamed: 0,PostalCode,Borough,Neighborhood
4,M1B,Scarborough,"Malvern, Rouge"
20,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
92,M1E,Scarborough,"Guildwood, Morningside, West Hill"
70,M1G,Scarborough,Woburn
37,M1H,Scarborough,Cedarbrae
...,...,...,...
87,M9N,York,Weston
22,M9P,Etobicoke,Westmount
54,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
56,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


They all match up on this page, should be fine to merge now! First we'll match the latitudes and then match the longitudes. Check the first 15 to see how we did!

In [113]:
postal_codes['Latitude'] = lat_long['Latitude']
postal_codes['Longitude'] = lat_long['Longitude']


In [114]:
postal_codes

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
4,M1B,Scarborough,"Malvern, Rouge",43.773136,-79.239476
20,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.757490,-79.374714
92,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.628841,-79.520999
70,M1G,Scarborough,Woburn,43.648429,-79.382280
37,M1H,Scarborough,Cedarbrae,43.676357,-79.293031
...,...,...,...,...,...
87,M9N,York,Weston,43.662744,-79.321558
22,M9P,Etobicoke,Westmount,43.770120,-79.408493
54,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.657162,-79.378937
56,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.644771,-79.373306
