<h1 align=center><font size = 5>Creating the Final Dataframe of the List of Postal Codes of Canada with Latitude and Longitude Coordinates</font></h1>

This project is part of the Applied Data Science Capstone that explores and clusters the neighborhoods in Toronto, Canada. But first, a dataframe of the list of postal codes will be created. After that, the data of latitude and the longitude coordinates of each neighborhood will be added. The coordinates are tracked from Google Maps.

In [48]:
import numpy as np
import pandas as pd

#!conda install -c anaconda beautifulsoup4   #uncomment if installing for the first time

from bs4 import BeautifulSoup

Download list of postal codes of Canada using the URL

In [49]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
table = pd.read_html(url)

Then, convert the downloaded list to dataframe

In [50]:
neighborhoods = table[0]
neighborhoods

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


We only need to process the cells that have an assigned borough. Hence, remove data with Borough column equals to "Not assigned"

In [51]:
neighborhoods2 = neighborhoods[neighborhoods.Borough != 'Not assigned']
neighborhoods2

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


More than one neighborhood can exist in one postal code area. If exist, they needed to be combined into one row with the neighborhoods separated with a comma.
So, determine if there are duplicates in postal code.

In [52]:
duplicates = neighborhoods2[neighborhoods2.duplicated(['Postal Code'])]
print(duplicates)

Empty DataFrame
Columns: [Postal Code, Borough, Neighbourhood]
Index: []


As showed in the result above, there seem to be no duplicates in postal code.

Now, if a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough. The code below will do this.

In [53]:
neighborhoods2.loc[(neighborhoods2.Borough!="Not assigned") & (neighborhoods2.Neighbourhood=="Not assigned"), 'Neighbourhood'] = neighborhoods2.Borough
neighborhoods2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Lastly, print the number of rows and columns of the final dataframe.

In [54]:
neighborhoods2.shape

(103, 3)

### Next is to add the latitude and the longitude coordinates of each neighborhood.

In [55]:
!wget -q -O 'Geospatial_Coordinates.csv' https://cocl.us/Geospatial_data

In [56]:
lat_lng = pd.read_csv('Geospatial_Coordinates.csv')
lat_lng.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [57]:
lat_lng.shape

(103, 3)

In [58]:
neighborhoods3 = pd.merge(neighborhoods2, lat_lng)
neighborhoods3.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [60]:
neighborhoods3.shape

(103, 5)