# <center>Applied Data Science Capstone</center>

#### <center>In completion of requirements for the IBM Data Science Professional Certificate on Coursera</center>

<hr>

This file will be used to implement a capstone data science project using location data from Foursquare.

Watch it grow on [GitHub](https://github.com/Arkadiatri/Coursera_Capstone/)!

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup # had to install to environment in Anaconda
import lxml # had to install to environment in Anaconda, backdated to 4.6.1 (4.6.2 current) for pandas read_html()
import html5lib # had to install to environment in Anaconda (1.1 current) for pandas read_html()

In [2]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


## Table of Toronto postal codes

First we will use the Requests module to get the [webpage]('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M') containing the data we need.

In [3]:
import requests

In [4]:
url_wikipedia_postal_codes = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [5]:
webpage = requests.get(url_wikipedia_postal_codes)

We could inspect webpage.text and extract the tables with BeautifulSoup, but parsing tables from the webpage text is actually handled by Pandas.

Let's inspect the tables automatically parsed from the page:

In [6]:
df = pd.read_html(webpage.text)
for i, d in enumerate(df):
    print(f'Table {i}:')
    display(d.head())

Table 0:


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Table 1:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,,Canadian postal codes,,,,,,,,,,,,,,,,
1,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,,,,,,,,,,,,,,,
2,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,ON,ON,MB,SK,AB,BC,NU/NT,YT
3,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


Table 2:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,ON,ON,MB,SK,AB,BC,NU/NT,YT
1,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


By inspection, the table we want is at index 0.

In [7]:
df = df[0].astype(str)
df.shape

(180, 3)

Let's clean up the columns.

First, the Postal Code is expected to be unique:

In [8]:
len(df['Postal Code'].unique()) == len(df['Postal Code'])

True

So rows are uniquely indexed by the Postal Code, as desired, and we do not have to combine neighborhoods into a comma separated list as per the assignment instructions - they already are.

Second, we take only the Borough that are not 'Not Assigned':

In [9]:
df = df[df['Borough']!='Not assigned'].reset_index(drop=True)
display(df.head())
df.shape

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


(103, 3)

So 77 Postal Codes were not assigned to a Borough.

We should also ensure that all Borough names are legitimate.  This must be done manually.

In [10]:
df['Borough'].unique()

array(['North York', 'Downtown Toronto', 'Etobicoke', 'Scarborough',
       'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

Those all look like Borough names, and there are 10 total.

Third, we want each entry in Neighborhood that is 'Not assigned' to be the Borough name.

In [11]:
def showalldf(df):
    old_opt = pd.get_option('display.max_rows')
    numel = len(df.index)
    pd.set_option('display.max_rows', len(df.index))
    display(df)
    pd.set_option('display.max_rows', old_opt)

In [57]:
print(f"There are {df[df['Neighbourhood']=='Not assigned'].size} Neighborhoods not assigned")
print('Neighborhoods are:')
showalldf(df['Neighbourhood'])

There are 0 Neighborhoods not assigned
Neighborhoods are:


0                                              Parkwoods
1                                       Victoria Village
2                              Regent Park, Harbourfront
3                       Lawrence Manor, Lawrence Heights
4            Queen's Park, Ontario Provincial Government
5                Islington Avenue, Humber Valley Village
6                                         Malvern, Rouge
7                                              Don Mills
8                        Parkview Hill, Woodbine Gardens
9                               Garden District, Ryerson
10                                             Glencairn
11     West Deane Park, Princess Gardens, Martin Grov...
12                Rouge Hill, Port Union, Highland Creek
13                                             Don Mills
14                                      Woodbine Heights
15                                        St. James Town
16                                    Humewood-Cedarvale
17     Eringate, Bloordale Gard

So all Neighbourhood entries look non-empty.

But on closer inspection there are entries that don't make sense for a neighborhood analysis, so we should drop them:
* Row 76: Canada Post Gateway Processing Centre
* Row 92: Stn A PO Boxes
* Row 100: Business reply mail Processing Centre, South Central Letter Processing Plant Toronto

In [60]:
df_cleaned = df.drop(index=[76, 92, 100]).reset_index(drop=True)

We display the final dataframe here:

In [61]:
showalldf(df_cleaned)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


And finally, the postal codes dataframe shape is:

In [62]:
df_cleaned.shape

(100, 3)

## Latitude and Longitude for each Toronto Postal Code

After looking through many geopy and geocoder options, I chose to use Nominatim.  While Nominatim doesn't succeed for partial postal code lookup, it did succeed with many neighborhoods.  Sevearal missed assignments were due to using slightly different names, but there were still over 80 imperfectly assigned neighborhood locations.

I then tried using a commercial service.  Google appears to charge for any developer use, but HERE has a freemium service which I did sign up for.  With an API key I was able to get geopy to geocode a partial postal code, so I'll be continuing work there.  Note that the geocoder module does not seem to work, possibly due to a misconfigured web address for requests (based on the errors returned).

To keep api keys secret, they are stored in a config.py file listed in the .gitignore file on GitHub.

To reload the configuration file should it need to be changed in development, run:

    import importlib
    importlib.reload(config)

For HERE, we need supply just the REST API key (on their site, called the app code)

In [81]:
import config

In [93]:
from geopy.geocoders import Here
gc = Here(apikey=config.HERE_APIKEY)

Now we can try to geocode the postal codes

In [143]:
df_c = df_cleaned.copy(deep=True)
for i, row in df_c.iterrows():
    g = gc.geocode(f"{row['Postal Code']}, Ontario")
    if g!=None:
        df_c.loc[i,'GString'] = g[0]
        df_c.loc[i,'Latitude'] = g[1][0]
        df_c.loc[i,'Longitude'] = g[1][1]

Inspecting for missing or invalid values:

In [144]:
print(f"There are {df_c[df_c['GString']==None].shape[0]} rows where geocoding failed")
df_tmp = df_c['GString'].copy()
for i, val in enumerate(df_tmp):
    df_tmp[i] = val.find(df_c.loc[i,'Postal Code'])
print(f"There are {sum(df_tmp==-1)} rows where the postal code is not found in the geocoded Location")
df_c.head()

There are 0 rows where geocoding failed
There are 0 rows where the postal code is not found in the geocoded Location


Unnamed: 0,Postal Code,Borough,Neighbourhood,GString,Latitude,Longitude
0,M3A,North York,Parkwoods,"M3A, Toronto, ON, Canada, Toronto, ON M3A, CAN",43.75245,-79.32991
1,M4A,North York,Victoria Village,"M4A, Toronto, ON, Canada, Toronto, ON M4A, CAN",43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront","M5A, Toronto, ON, Canada, Toronto, ON M5A, CAN",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights","M6A, Toronto, ON, Canada, Toronto, ON M6A, CAN",43.72327,-79.45042
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government","M7A, Toronto, ON, Canada, Toronto, ON M7A, CAN",43.66253,-79.39188


Wow, that was so much easier than trying to massage the Nominatim data.

Let's load up the csv file of locations and see if it matches.

In [145]:
df_gt = pd.read_csv('Geospatial_Coordinates.csv')
df_gt.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [146]:
df_gt.shape

(103, 3)

In [147]:
df_gt.rename(columns={'Latitude':'GT Latitude','Longitude':'GT Longitude'}, inplace=True)
df_gt.head()

Unnamed: 0,Postal Code,GT Latitude,GT Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [148]:
df_d = df_c.merge(df_gt, on='Postal Code')
df_d

Unnamed: 0,Postal Code,Borough,Neighbourhood,GString,Latitude,Longitude,GT Latitude,GT Longitude
0,M3A,North York,Parkwoods,"M3A, Toronto, ON, Canada, Toronto, ON M3A, CAN",43.75245,-79.32991,43.753259,-79.329656
1,M4A,North York,Victoria Village,"M4A, Toronto, ON, Canada, Toronto, ON M4A, CAN",43.73057,-79.31306,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront","M5A, Toronto, ON, Canada, Toronto, ON M5A, CAN",43.65512,-79.36264,43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights","M6A, Toronto, ON, Canada, Toronto, ON M6A, CAN",43.72327,-79.45042,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government","M7A, Toronto, ON, Canada, Toronto, ON M7A, CAN",43.66253,-79.39188,43.662301,-79.389494
...,...,...,...,...,...,...,...,...
95,M5X,Downtown Toronto,"First Canadian Place, Underground city","M5X, Toronto, ON, Canada, Toronto, ON M5X, CAN",43.64828,-79.38146,43.648429,-79.382280
96,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North","M8X, Toronto, ON, Canada, Toronto, ON M8X, CAN",43.65319,-79.51113,43.653654,-79.506944
97,M4Y,Downtown Toronto,Church and Wellesley,"M4Y, Toronto, ON, Canada, Toronto, ON M4Y, CAN",43.66659,-79.38133,43.665860,-79.383160
98,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...","M8Y, Toronto, ON, Canada, Toronto, ON M8Y, CAN",43.63278,-79.48945,43.636258,-79.498509


In [149]:
df_d['Diff Lat'] = df_d['Latitude']-df_d['GT Latitude']
df_d['Diff Lon'] = df_d['Longitude']-df_d['GT Longitude']
df_d.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,GString,Latitude,Longitude,GT Latitude,GT Longitude,Diff Lat,Diff Lon
0,M3A,North York,Parkwoods,"M3A, Toronto, ON, Canada, Toronto, ON M3A, CAN",43.75245,-79.32991,43.753259,-79.329656,-0.000809,-0.000253
1,M4A,North York,Victoria Village,"M4A, Toronto, ON, Canada, Toronto, ON M4A, CAN",43.73057,-79.31306,43.725882,-79.315572,0.004688,0.002512
2,M5A,Downtown Toronto,"Regent Park, Harbourfront","M5A, Toronto, ON, Canada, Toronto, ON M5A, CAN",43.65512,-79.36264,43.65426,-79.360636,0.00086,-0.002004
3,M6A,North York,"Lawrence Manor, Lawrence Heights","M6A, Toronto, ON, Canada, Toronto, ON M6A, CAN",43.72327,-79.45042,43.718518,-79.464763,0.004752,0.014343
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government","M7A, Toronto, ON, Canada, Toronto, ON M7A, CAN",43.66253,-79.39188,43.662301,-79.389494,0.000228,-0.002386


In [167]:
print(f"Latitude difference:  {df_d['Diff Lat'].abs().mean():0.5f} +/- {df_d['Diff Lat'].abs().std():0.5f} with maximum {df_d['Diff Lat'].abs().max():0.5f}")
print(f"Longitude difference: {df_d['Diff Lon'].abs().mean():0.5f} +/- {df_d['Diff Lon'].abs().std():0.5f} with maximum {df_d['Diff Lon'].abs().max():0.5f}")

Latitude difference:  0.00253 +/- 0.00261 with maximum 0.01830
Longitude difference: 0.00352 +/- 0.00330 with maximum 0.01464


What do these differences mean in terms of distance?

In [180]:
from geopy import distance
for i, row in df_d.iterrows():
    df_d.loc[i,'GT Diff Meters'] = distance.distance((row['Latitude'],row['Longitude']),(row['GT Latitude'],row['GT Longitude'])).meters
print(f"Differences: {df_d['GT Diff Meters'].mean():0.1f} +/- {df_d['GT Diff Meters'].std():0.1f} meters with maximum {df_d['GT Diff Meters'].max():0.1f} meters")
df_d.head()

Differences: 438.4 +/- 348.7 meters with maximum 2189.5 meters


Unnamed: 0,Postal Code,Borough,Neighbourhood,GString,Latitude,Longitude,GT Latitude,GT Longitude,Diff Lat,Diff Lon,GT Diff Meters
0,M3A,North York,Parkwoods,"M3A, Toronto, ON, Canada, Toronto, ON M3A, CAN",43.75245,-79.32991,43.753259,-79.329656,-0.000809,-0.000253,92.13208
1,M4A,North York,Victoria Village,"M4A, Toronto, ON, Canada, Toronto, ON M4A, CAN",43.73057,-79.31306,43.725882,-79.315572,0.004688,0.002512,558.767359
2,M5A,Downtown Toronto,"Regent Park, Harbourfront","M5A, Toronto, ON, Canada, Toronto, ON M5A, CAN",43.65512,-79.36264,43.65426,-79.360636,0.00086,-0.002004,187.80158
3,M6A,North York,"Lawrence Manor, Lawrence Heights","M6A, Toronto, ON, Canada, Toronto, ON M6A, CAN",43.72327,-79.45042,43.718518,-79.464763,0.004752,0.014343,1270.683919
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government","M7A, Toronto, ON, Canada, Toronto, ON M7A, CAN",43.66253,-79.39188,43.662301,-79.389494,0.000228,-0.002386,194.136996


That's a very big difference!  Though most are smaller, we'd still like some confidence in knowing which to use.

Manually inspecting Google Maps, the outline for M6A postal code has the HERE Lat/Long at the center, and the csv value near the left edge of the zone.  For M8Y, both HERE and csv are a little off, though HERE is probably closer.  Ideally with the postal code polygons we could find the center of mass, but that level of exactness won't help us.

Let's stick with the values from the HERE API lookup (in df_c), and drop the GString as it has served its purpose for validation.

In [183]:
df_c.drop(columns='GString',inplace=True)
df_c

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75245,-79.32991
1,M4A,North York,Victoria Village,43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188
...,...,...,...,...,...
95,M5X,Downtown Toronto,"First Canadian Place, Underground city",43.64828,-79.38146
96,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.65319,-79.51113
97,M4Y,Downtown Toronto,Church and Wellesley,43.66659,-79.38133
98,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.63278,-79.48945


## K-Means Clustering

What is the nearest neighbor distance for each neighborhood, so that we can define a reasonable search radius?

In [186]:
dist_matrix = []
for i, row in df_c.iterrows():
    # This implementation is slow by a factor of 2, look into distance matrix functions
    dist_matrix.append([distance.distance((row['Latitude'],row['Longitude']),(x['Latitude'],x['Longitude'])).meters for j, x in df_c.iterrows()])
dist_nn = [min([x for x in y if x > 0]) for y in dist_matrix]
dist_nn[0:5]

[0.0,
 2784.281592959613,
 11131.144283352676,
 10235.15951116031,
 11169.664247576218,
 18849.394368486333,
 12570.334123951307,
 2598.038673910266,
 5234.490657138429,
 11251.711461517778,
 10786.819381322573,
 21304.635000375703,
 14268.94893027473,
 3590.276605939408,
 7215.954187458286,
 11743.622079169638,
 10510.095588036369,
 23107.233122310787,
 12586.110770593476,
 8821.026501855642,
 12396.385589136586,
 12078.817012869,
 9209.41225597602,
 5532.074576675346,
 11588.382512024702,
 11838.94193965751,
 7566.460775879684,
 5906.500161530724,
 9567.303450261858,
 5875.777121351951,
 12180.08393161513,
 13094.897818463312,
 8002.335653528734,
 3481.1774040313508,
 12802.748531817519,
 7156.886513637299,
 12848.675382144527,
 13548.681843144448,
 6035.375398881722,
 5177.4640101858095,
 11336.066098669991,
 7898.7577850093185,
 12422.67280077151,
 14788.422002462634,
 5689.978565540293,
 4113.273973058146,
 15479.860150679471,
 9466.308400094027,
 12222.75908776462,
 13497.1014921

In [195]:
import matplotlib.pyplot as plt
%matplotlib inline 

ModuleNotFoundError: No module named 'matplotlib'

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

    to add enough Markdown cells to explain what you decided to do and to report any observations you make.
    to generate maps to visualize your neighborhoods and how they cluster together. 

Once you are happy with your analysis, submit a link to the new Notebook on your Github repository. (3 marks)

## APPENDIX

### Foray into BeautifulSoup for parsing webpages: obsolete because pandas.read_html() works fine.

Next we will parse the webpage using [Beautiful Soup]('https://beautiful-soup-4.readthedocs.io/en/latest/').

from bs4 import BeautifulSoup # had to install to environment in Anaconda

soup = BeautifulSoup(webpage.text, 'html.parser')

To view the HTML directly we could run:

    print(soup.prettify())
    
But we can also inspect the webpage for the table we expect:

tables = soup.find_all('table')
len(tables)

for i, table in enumerate(tables):
    print('Length of table {} string: {}, attributes: {}'.format(i,len(table.text),table.attrs))

import lxml
import html5lib

str(tables[0])

df_pcodes = pd.read_html(str(tables[0]))[0]
#df_pcodes = pd.DataFrame(df_pcodes[1:][:],df_pcodes[1:][1],df_pcodes[0][:])
type(df_pcodes)
df_pcodes

### Installation of dependencies: with relation to developing in Anaconda

    conda install -c conda-forge geocoder

Gave problem with openssl-1.1.1h-he774522_0.tar.bz2

From Anaconda Navigator, I removed openssl, restarted, then installed it again.

Doing so seemed to reset the environment; I had to reinstall jupyterlab, pandas, numpy, lxml, html5lib, bs4.

I also discovered the Channels setting, and by adding conda-forge I was able to access packages that I installed through the prompt before:
geocoder, jupyterlab-git, ipywidgets.

So there may be additional fallout from this, but at least I can use Aanconda as intended now that I can get the packages I need through the UI.

### Geocoding

Unfortunately, geocoder.google('Ottawa, ON') and other services from geocoder failed to return data (over 200 explicit calls and over 20 minutes in a loop in the case of google, except for CanadaPost which did work, but that only returns a postal code).  Instead I elect to use Nominatim, we'll see if that works for Toronto...

The following cells are the original geocoding work:

I would like to try to get latitude and longitude for each neighborhood, so let's make a new dataframe:

    df_n = pd.DataFrame(columns=df_cleaned.columns) 
    for i, row in df_cleaned.iterrows():
        for neighborhood in row['Neighbourhood'].split(', '):
            df_n = df_n.append(row, ignore_index=True)
            df_n.loc[df_n.shape[0]-1,'Neighbourhood'] = neighborhood
    df_n.head(10)

    When a neighborhood appears in several postal codes, part of the neighborhood appears in each one.  In absence of geographic boundary polygons, let's weight the location of each neighborhood by the inverse of its number of appearances overall when we average the latitude and longitude of neighborhoods to get the postal code locations.  So let's count the number of times the neighborhood appears in the list.

    for i, row in df_n.iterrows():
        df_n.loc[i, 'Count'] = int(df_n[df_n['Neighbourhood']==row['Neighbourhood']].shape[0])
    df_n['Count'] = pd.to_numeric(df_n['Count'], downcast='integer')

    display(df_n.head())
    df_n[['Count']].value_counts(sort=False)

    Good, not many neighborhoods appear in more than one postal code, so this should not have much effect on the results.

    Next we will geocode the neighborhoods.

    Attempting to geocode a latitude and longitude in a non-commercial environment requires some effort.

    Going through some of the list at the geocoder [documentation](https://geocoder.readthedocs.io/):
    * geopy.geocoders.google never returned a location in over 200 calls
    * geopy.geocoders.canadapost does note return latitude and longitude
    * geopy.geocoders.geolytica throws AttributeError: 'dict' object has no attribute 'strip' (see alternate geocoder.ca approach below)
    * geopy.geocoders.bing requires an API key
    * geopy.geocoders.tomtom requires an API key
    * geopy.geocoders.mapquest requires an API key
    * geopy.geocoders.mapbox requires an API key
    * geopy.geocoders.yahoo returns a KeyError: 'statusDescription'
    * geopy.geocoders.ottawa does not return many responses
    * geopy.geocoders.Nominatim (osm) gives a reasonable number of responses, will go with this

    Another option is to go though website requests directly, e.g.

        https://geocoder.ca/T0H 2N2?json=1

    This returns a JSON with latitude and longitude for a postal code.  However, this requires a full code and is limited to 500-2000 requests per day.  An exhaustive search of final three digits of postal code and averaging the locations of the returns would require ~2600 requests per entry, which is prohibitive.

    from geopy.geocoders import Nominatim
    from geopy.extra.rate_limiter import RateLimiter

    import config

    geolocator = Nominatim(user_agent=config.NM_AGENT)
    geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1.5) # Hard minimum is 1.0 seconds

    df_n['GString'] = None
    df_n['Lat'] = None
    df_n['Lon'] = None
    df_n.head()

    for i, row in df_n.iterrows():
        g = geocode(f"{row['Neighbourhood']}, Toronto, Ontario")
        if g!=None:
            df_n.loc[i, 'GString'] = g[0]
            df_n.loc[i, 'Lat'] = g[1][0]
            df_n.loc[i, 'Lon'] = g[1][1]
    df_n.head()

    Let's find the indices where the lookup failed:

    ind_fail = df_n.index[~df_n['GString'].notnull()]
    print(f'Lookup failed for {len(ind_fail)} neighbourhoods')

    Construct a dataframe of these missing values for refilling:

    df_nfail = df_n.loc[ind_fail,:]
    showalldf(df_nfail)

    Let's try filling in these blanks by adjusting the search string:

    for i, row in df_nfail.iterrows():
        g = geocode(f"{row['Neighbourhood']}, {row['Borough']}, Ontario")
        if g!=None:
            df_nfail.loc[i, 'GString'] = g[0]
            df_nfail.loc[i, 'Lat'] = g[1][0]
            df_nfail.loc[i, 'Lon'] = g[1][1]
    showalldf(df_nfail)

    Not so easy, let's try adding the postal code:

    geocode(f"Capitol Building, Toronto, Ontario")

    geocode('Caledonia, Toronto, Ontario')

    geocode('Fairbank, Toronto, Ontario')

    geocode('Caledonia-Fairbank, Toronto, Ontario')

    geocode('Del Rey, Toronto, Ontario')

    geocode('Keelesdale, Toronto, Ontario')

    geocode('Silverthorn, Toronto, Ontario')

    geocode('Keelesdale and Silverthorn, Toronto, Ontario')

    geocode('union station, Toronto, Ontario')

    geocode('local airport, Toronto, Ontario')

    geocode('humber bay, Toronto, Ontario')

    geocode('beaumonde heights, Toronto, Ontario')

    So some locations can be found; names are spelled slightly differently, and some are not correct.

    Going through Google Maps there is functionality for obtaining boudaries of postal codes, and I used this to check the above, but I don't particularly want to pay for that.



    Let's also find where the returned location string does not match the neighborhood name exactly, and make a new dataframe for those entries as well.

    showalldf(df_n)

    ind_mismatch = []
    for i, row in df_n.iterrows():
        if row['GString']!=None and row['GString'].split(',')[0]!=row['Neighbourhood']:
            ind_mismatch.append(i)
    ind_mismatch = df_n.index[ind_mismatch]
    print(f'Non-exact lookups for {len(ind_mismatch)} neighbourhoods')

    df_nmismatch = df_n.loc[ind_mismatch,:]
    df_nmismatch.head()



    for i, row in df_n.iterrows():
        if row['GString']!=None:
            df_n.loc[i,'Found?'] = row['GString'].split(',')[0]==row['Neighbourhood']
    df_n.head(100)

    df_n.loc[df_n['Found?']==False,'Found?'].count()

    for _, row in df_n.iterrows():
        print(f"{row['Neighbourhood']}, Toronto, Ontario")

    ll = []
    for b in df['Borough'].unique():
        ll.append(geocode(f'{b}, ON, Canada'))
    ll

    df['Borough'].unique()

    llpc = [geocode(f'{pc}, ON, Canada') for pc in df['Postal Code']]
    llpc

    This didn't turn out so well.  Let's try with neighborhood names.

    df['Neighbourhood'][2].split(',')

    df.rows()

    llnt = [geocode(f"{n}, {r['Borough']}, Ontario, Canada") for _, r in df.iterrows() for n in r['Neighbourhood'].split(',')]

    llnt

    llnt2 = [None if r['Borough']!='Downtown Toronto' else geocode(f"{n}, {r['Borough'].split()[-1]}, Ontario, Canada") for _, r in df.iterrows() for n in r['Neighbourhood'].split(',')]
    llnt2

    For below, 'Downtown Toronto' fails, maybe just put Toronto, or omit the Borough

    # Combine llnt and llnt2
    llnt3 = [tt if t==None else t for t, tt in zip(llnt, llnt2)]
    llnt3

    # Get the indices of the None entries for further inspection
    llnt3list = [i for i, x in enumerate(llnt3) if x==None]
    print(f'Number of missing entries: {len(llnt3list)}')
    print(f'At indices: {llnt3list}')

    Let's try again, as other modifiers of Toronto in the Borough name may be confusing

    llnt4 = [None if r['Borough'].split()[-1]!='Toronto' else geocode(f"{n}, {r['Borough'].split()[-1]}, Ontario, Canada") for _, r in df.iterrows() for n in r['Neighbourhood'].split(',')]
    llnt4

    # Combine results
    llnt5 = [tt if t==None else t for t, tt in zip(llnt3, llnt4)]
    llnt5

    # Get the indices of the None entries for further inspection
    llnt5list = [i for i, x in enumerate(llnt5) if x==None]
    print(f'Number of missing entries: {len(llnt5list)}')
    print(f'At indices: {llnt5list}')

    So two more entries were found, better than nothing.

    Let's work at combining entries to make a lat/long for each postal code



    lllist = [x[1] for x in llnt5 if x!=None]
    a, b = zip(*lllist)
    print(min(a), max(a), np.std(a), min(b), max(b), np.std(b))

    [(f"{n}, {r['Borough']}, Ontario, Canada") for _, r in df.iterrows() for n in r['Neighbourhood'].split(',')]

    lln = [geocode(f'{n}, ON, Canada') for n in [s.split(',', trim=True) for s in df['Neighbourhood']]]
    lln

    df['location'] = df['name'].apply(geocode)
    df.head()

    address = 'The Kingsway, Toronto, ON'
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    location

    address = 'Old Mill North, Toronto, ON'
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    location

    address = 'Montgomery Road, Toronto, ON'
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    location

    import geocoder
    # initialize your variable to None
    lat_lng_coords = None

    postal_code = 'M3A'

    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
      lat_lng_coords = g.latlng

