# <center>Applied Data Science Capstone</center>

#### <center>In completion of requirements for the IBM Data Science Professional Certificate on Coursera</center>

<hr>

This file will be used to implement a capstone data science project using location data from Foursquare.

Watch it grow on [GitHub](https://github.com/Arkadiatri/Coursera_Capstone/)!

In [2]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup # had to install to environment in Anaconda
import lxml # had to install to environment in Anaconda, backdated to 4.6.1 (4.6.2 current) for pandas read_html()
import html5lib # had to install to environment in Anaconda (1.1 current) for pandas read_html()

In [3]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


## Table of Toronto postal codes

First we will use the Requests module to get the [webpage]('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M') containing the data we need.

In [4]:
import requests

In [5]:
url_wikipedia_postal_codes = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [6]:
webpage = requests.get(url_wikipedia_postal_codes)

We could inspect webpage.text and extract the tables with BeautifulSoup, but parsing tables from the webpage text is actually handled by Pandas.

Let's inspect the tables automatically parsed from the page:

In [7]:
df = pd.read_html(webpage.text)
for i, d in enumerate(df):
    print(f'Table {i}:')
    display(d.head())

Table 0:


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Table 1:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,,Canadian postal codes,,,,,,,,,,,,,,,,
1,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,NL NS PE NB QC ON MB SK AB BC NU/NT YT A B C E...,,,,,,,,,,,,,,,
2,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,ON,ON,MB,SK,AB,BC,NU/NT,YT
3,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


Table 2:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,NL,NS,PE,NB,QC,QC,QC,ON,ON,ON,ON,ON,MB,SK,AB,BC,NU/NT,YT
1,A,B,C,E,G,H,J,K,L,M,N,P,R,S,T,V,X,Y


By inspection, the table we want is at index 0.

In [8]:
df = df[0]
df.shape

(180, 3)

Let's clean up the columns.

First, the Postal Code is expected to be unique:

In [9]:
len(df['Postal Code'].unique()) == len(df['Postal Code'])

True

So rows are uniquely indexed by the Postal Code, as desired, and we do not have to combine neighborhoods into a comma separated list as per the assignment instructions - they already are.

Second, we take only the Borough that are not 'Not Assigned':

In [10]:
df = df[df['Borough']!='Not assigned'].reset_index(drop=True)
display(df.head())
df.shape

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


(103, 3)

So 77 Postal Codes were not assigned to a Borough.

We should also ensure that all Borough names are legitimate.  This must be done manually.

In [11]:
df['Borough'].unique()

array(['North York', 'Downtown Toronto', 'Etobicoke', 'Scarborough',
       'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

Those all look like Borough names, and there are 10 total.

Third, we want each entry in Neighborhood that is 'Not assigned' to be the Borough name.

In [12]:
print(f"There are {df[df['Neighbourhood']=='Not assigned'].size} Neighborhoods not assigned")
pd.set_option('display.max_rows', 1000)
print('Neighborhoods are:')
display(df['Neighbourhood'])
pd.set_option('display.max_rows', 10)

There are 0 Neighborhoods not assigned
Neighborhoods are:


0                                              Parkwoods
1                                       Victoria Village
2                              Regent Park, Harbourfront
3                       Lawrence Manor, Lawrence Heights
4            Queen's Park, Ontario Provincial Government
5                Islington Avenue, Humber Valley Village
6                                         Malvern, Rouge
7                                              Don Mills
8                        Parkview Hill, Woodbine Gardens
9                               Garden District, Ryerson
10                                             Glencairn
11     West Deane Park, Princess Gardens, Martin Grov...
12                Rouge Hill, Port Union, Highland Creek
13                                             Don Mills
14                                      Woodbine Heights
15                                        St. James Town
16                                    Humewood-Cedarvale
17     Eringate, Bloordale Gard

So all Neighbourhood entries look OK.

We display the final dataframe here:

In [13]:
pd.set_option('display.max_rows', 1000)
display(df)
pd.set_option('display.max_rows', 10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


And finally, the postal codes dataframe shape is:

In [14]:
df.shape

(103, 3)

## Latitude and Longitude for each Toronto Postal Code

In [53]:
from geopy.geocoders import Nominatim

In [None]:
NM_AGENT = 'coursera-dn7681'
geolocator = Nominatim(user_agent=NM_AGENT)

In [None]:
from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
df['location'] = df['name'].apply(geocode)

In [55]:
address = 'The Kingsway, Toronto, ON'
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
location

Location(The Kingsway, Etobicoke—Lakeshore, Etobicoke, Toronto, Golden Horseshoe, Ontario, M8X 1C3, Canada, (43.6473811, -79.5113328, 0.0))

In [57]:
address = 'Old Mill North, Toronto, ON'
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
location

Location(Old Mill, Bloor Street West, Old Mill, Etobicoke—Lakeshore, Etobicoke, Toronto, Golden Horseshoe, Ontario, M8X 0A5, Canada, (43.649826, -79.4943338, 0.0))

In [None]:
address = 'Montgomery Road, Toronto, ON'
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
location

In [54]:
location

Location(Wellesley, Wellesley Bus Terminal, Church-Wellesley Village, Toronto Centre, Old Toronto, Toronto, Golden Horseshoe, Ontario, M4Y 1Z2, Canada, (43.6655242, -79.3838011, 0.0))

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking postal code M5G as an example, your code would look something like this:

PostalCode Boroug Neighborhood Latitude Longitude

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

Use the Geocoder package or the csv file to create the following dataframe:

Important Note: There is a limit on how many times you can call geocoder.google function. It is 2500 times per day. This should be way more than enough for you to get acquainted with the package and to use it to get the geographical coordinates of the neighborhoods in the Toronto.

Once you are able to create the above dataframe, submit a link to the new Notebook on your Github repository. (2 marks)

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

    to add enough Markdown cells to explain what you decided to do and to report any observations you make.
    to generate maps to visualize your neighborhoods and how they cluster together. 

Once you are happy with your analysis, submit a link to the new Notebook on your Github repository. (3 marks)

## APPENDIX

### Foray into BeautifulSoup for parsing webpages: obsolete because pandas.read_html() works fine.

Next we will parse the webpage using [Beautiful Soup]('https://beautiful-soup-4.readthedocs.io/en/latest/').

from bs4 import BeautifulSoup # had to install to environment in Anaconda

soup = BeautifulSoup(webpage.text, 'html.parser')

To view the HTML directly we could run:

    print(soup.prettify())
    
But we can also inspect the webpage for the table we expect:

tables = soup.find_all('table')
len(tables)

for i, table in enumerate(tables):
    print('Length of table {} string: {}, attributes: {}'.format(i,len(table.text),table.attrs))

import lxml
import html5lib

str(tables[0])

df_pcodes = pd.read_html(str(tables[0]))[0]
#df_pcodes = pd.DataFrame(df_pcodes[1:][:],df_pcodes[1:][1],df_pcodes[0][:])
type(df_pcodes)
df_pcodes

### Installation of dependencies: with relation to developing in Anaconda

    conda install -c conda-forge geocoder

Gave problem with openssl-1.1.1h-he774522_0.tar.bz2

From Anaconda Navigator, I removed openssl, restarted, then installed it again.

Doing so seemed to reset the environment; I had to reinstall jupyterlab, pandas, numpy, lxml, html5lib, bs4.

I also discovered the Channels setting, and by adding conda-forge I was able to access packages that I installed through the prompt before:
geocoder, jupyterlab-git, ipywidgets.

So there may be additional fallout from this, but at least I can use Aanconda as intended now that I can get the packages I need through the UI.

### Geocoding

Unfortunately, geocoder.google('Ottawa, ON') and other services from geocoder failed to return data (over 200 calls in the case of google, except for CanadaPost which did work, but that only returns a postal code).  Instead I elect to use Nominatim, we'll see if that works for Toronto...