# 1. Web scraping

Importing useful libraries for our web scraping process

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
# Official link provided in Assignment page
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

# Due to changes in webpage, we will be using old revision of the above link
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969"

Web scraping process using [BeautifulSoap](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and [requests](https://requests.readthedocs.io/)

In [None]:
canada_postal_codes = "PostalCode;Borough;Neighborhood\n" # Columns for our csv file
filename = "canada_postal_codes.csv"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'}) # Find table with class name as 'wikitable, sortable'
for row in table.find_all('tr')[1:]:  # iterate over each table row
  code, borough, neighbor = [col.text.strip() for col in row.find_all('td')]  # extract table values
  if borough != "Not assigned":
    if neighbor == "Not assigned":
      neighbor = borough
    canada_postal_codes += f'{code};{borough};{neighbor}\n' # insert values into the table
with open(filename, "w") as f:  # create a csv file and insert records to the file
  f.write(canada_postal_codes)

Importing Data Manipulation Libraries in python

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv(filename, ';')
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [None]:
len(df.PostalCode.unique())

103

In [None]:
df.shape

(103, 3)

Thus, Postal Code is unique for each records.

# 2. Geographical coordinates

Downloading the provided csv file that has Geographical coordinates of each postal code.

In [None]:
geospatial_df = pd.read_csv("https://cocl.us/Geospatial_data")
geospatial_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We will also utilizing [Geocoder](https://geocoder.readthedocs.io/index.html) package to fetch geographical coordinates

In [None]:
!pip install geocoder -q
import geocoder

There are tons of [geocode provider](https://geocoder.readthedocs.io/index.html#providers) supported by Geocoder package. But most of them has limitations, needs APIs, registration and others. Out of which, free providers are listed below.
- [ArcGIS](https://geocoder.readthedocs.io/providers/ArcGIS.html)
- [GeocodeFarm](https://geocoder.readthedocs.io/providers/GeocodeFarm.html)
- Komoot
- Location
- OSM

We will analyze and compare all of them to the provided Geographical coordinate that was previously downloaded.

In [None]:
data = []
i = 1
for code in df.PostalCode:
  print('\rProcessing:', i, end="")
  query = '{}, Toronto, Ontario'.format(code)
  row = [code]
  coders = [geocoder.arcgis, geocoder.geocodefarm, geocoder.komoot, geocoder.location, geocoder.osm]
  for c in coders:
    try:
      values = c(query).latlng
    except:
      values = [np.nan, np.nan]
    if values is None or len(values) !=2:
      values = [np.nan, np.nan]
    row.extend(values)
  row += list(geospatial_df[geospatial_df['Postal Code'] == code].to_numpy()[0])[1:]
  data.append(row)
  i += 1

Processing: 103

Using the above data, we can build a Dataframe as,

In [None]:
geocoder_df = pd.DataFrame(data, columns=pd.MultiIndex.from_tuples((('PostalCode', ''),
                                                                ("arcgis", "lat"), ("arcgis", "lng"),
                                                                ("geocodefarm", "lat"), ("geocodefarm", "lng"),
                                                                ("komoot", "lat"), ("komoot", "lng"),
                                                                ("location", "lat"), ("location", "lng"),
                                                                ("osm", "lat"), ("osm", "lng"),
                                                                ("google", "lat"), ("google", "lng"))))
geocoder_df.head()

Unnamed: 0_level_0,PostalCode,arcgis,arcgis,geocodefarm,geocodefarm,komoot,komoot,location,location,osm,osm,google,google
Unnamed: 0_level_1,Unnamed: 1_level_1,lat,lng,lat,lng,lat,lng,lat,lng,lat,lng,lat,lng
0,M3A,43.75245,-79.32991,43.756123,-79.329636,43.740375,-79.321746,43.652384,-79.383568,43.652384,-79.383568,43.753259,-79.329656
1,M4A,43.73057,-79.31306,43.72678,-79.310738,43.732658,-79.311189,,,,,43.725882,-79.315572
2,M5A,43.65512,-79.36264,43.655354,-79.365044,43.6514,-79.365837,,,,,43.65426,-79.360636
3,M6A,43.72327,-79.45042,43.721996,-79.445915,43.715283,-79.443914,,,,,43.718518,-79.464763
4,M7A,43.66253,-79.39188,43.66391,-79.388733,43.775062,-79.500381,43.652384,-79.383568,43.652384,-79.383568,43.662301,-79.389494


In [33]:
geocoder_df.isna().sum()

PostalCode           0
arcgis       lat     0
             lng     0
geocodefarm  lat     0
             lng     0
komoot       lat     0
             lng     0
location     lat    83
             lng    83
osm          lat    83
             lng    83
google       lat     0
             lng     0
dtype: int64

There are 83 null entries for location and osm provider each. Hence, these providers are unreliable in terms of providing geological coordinates.

We can use Root Mean Squared Error to analyze the quality of our geocode provider as,

In [None]:
from sklearn.metrics import mean_squared_error

In [35]:
mean_squared_error(geocoder_df.google, geocoder_df.arcgis, squared=False)

0.017241480466064354

In [41]:
mean_squared_error(geocoder_df.google, geocoder_df.komoot, squared=False)

0.030215889135091226

In [40]:
mean_squared_error(geocoder_df.google, geocoder_df.geocodefarm, squared=False)

0.01739045143128333

Thus, ArcGIS and GeocodeFarm are the most reliable Geological coordinate provider among the free geocode providers.

Lets take average of those geological coordinate from ArcGIS and GeocodeFarm, and see some improvement in the result.

In [43]:
geocoder_df_concat = pd.concat((geocoder_df.geocodefarm, geocoder_df.arcgis)).groupby(temp_df_concat.index)
mean_squared_error(geocoder_df.google, geocoder_df_concat.mean(), squared=False)

0.017235856388921184

Yes, there is a slight increase in quality of the Geological coordinate with the concatenated geocoder. We can stick with ArcGIS geocode provider for now. Lets use Geological coordinate provided by ArcGIS to create new columns in geocoder_df as mentioned in the task.

In [47]:
df[['Latitude', 'Longitude']] = geocoder_df['arcgis']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75245,-79.32991
1,M4A,North York,Victoria Village,43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188


Successfully created new columns - Latitude, Longitude - in geocoder_df