In [None]:
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

df=pd.read_html(url, header=0)[0]

df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


We have not assigned values in the Borough. We can remove those values through a simple way by making a new df file as shown below. The following code removes all rows under borough that contain not assigned values

In [None]:
n_df=df[df.Borough != 'Not assigned']


the following code cell is to double check whether there is any cell that contains not assigned.



In [None]:
for column in n_df.columns:
  print('\n'+column)
  print(n_df[column].value_counts())


Postal Code
M6H    1
M6A    1
M4E    1
M1R    1
M5V    1
      ..
M6G    1
M1L    1
M9N    1
M2P    1
M1C    1
Name: Postal Code, Length: 103, dtype: int64

Borough
North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
York                 5
East Toronto         5
East York            5
Mississauga          1
Name: Borough, dtype: int64

Neighbourhood
Downsview                                      4
Don Mills                                      2
Thorncliffe Park                               1
Upper Rouge                                    1
Wexford, Maryvale                              1
                                              ..
Moore Park, Summerhill East                    1
Queen's Park, Ontario Provincial Government    1
Woodbine Heights                               1
Golden Mile, Clairlea, Oakridge                1
Caledonia-Fairbanks                            1
Name: Neighbourhood, L

The following cell is to compare the number of not assigned values to see if they are the same in borough and neighbourhood. 

In [None]:
check=['Borough','Neighbourhood']
C=df[check]
for column in C.columns:
  print('\n'+column)
  print(C[column].value_counts())


Borough
Not assigned        77
North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East Toronto         5
York                 5
East York            5
Mississauga          1
Name: Borough, dtype: int64

Neighbourhood
Not assigned                      77
Downsview                          4
Don Mills                          2
High Park, The Junction South      1
Commerce Court, Victoria Hotel     1
                                  ..
Regent Park, Harbourfront          1
Cedarbrae                          1
Humber Summit                      1
Wexford, Maryvale                  1
Bayview Village                    1
Name: Neighbourhood, Length: 100, dtype: int64


Next we will find the shape of the data frame and compare the number of rows removed from the not assigned dataframe.

In [None]:
print(df.shape)
print(n_df.shape)
rows_removed=df.shape[0]-n_df.shape[0]
print('number of rows removed: ', rows_removed)


(180, 3)
(103, 3)
number of rows removed:  77


This part is to get the lon and lat coordinates using geocoder loop.

In [None]:
!pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |███▎                            | 10kB 17.9MB/s eta 0:00:01[K     |██████▋                         | 20kB 1.7MB/s eta 0:00:01[K     |██████████                      | 30kB 2.3MB/s eta 0:00:01[K     |█████████████▎                  | 40kB 2.5MB/s eta 0:00:01[K     |████████████████▋               | 51kB 2.0MB/s eta 0:00:01[K     |████████████████████            | 61kB 2.3MB/s eta 0:00:01[K     |███████████████████████▎        | 71kB 2.5MB/s eta 0:00:01[K     |██████████████████████████▋     | 81kB 2.7MB/s eta 0:00:01[K     |██████████████████████████████  | 92kB 2.9MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 2.3MB/s 
Collecting ratelim
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad4

geocoder was taking too long and didnt work properly. So to make things more efficient and accurate, We took the geospatial file and uploaded it as shown below.

In [None]:
from google.colab import files
uploaded=files.upload()


Saving Geospatial_Coordinates.csv to Geospatial_Coordinates.csv


In [None]:
import io
geo_data=pd.read_csv(io.BytesIO(uploaded['Geospatial_Coordinates.csv']))
geo_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [None]:
geo_data.shape

(103, 3)

Once the file has been uploaded, Its important to check the shape of this file with our n_df file to make sure we have the same number of entries. Once that is verified we will look into merging the files together.

In [None]:
n_df.shape

(103, 3)

In [None]:
g_df=n_df.merge(geo_data, on='Postal Code')
g_df.head()


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Once the file has been merged, let's check the shape to verify it contains the same number of rows and two more columns.

In [None]:
g_df.shape

(103, 5)

lets import json and Nominatim, folium and so on to make a map.