# Applied Data Science Capstone Project

<h2>Webscraping information about Toronto and its Neighborhoods </h2>

Below, we will be taking (aka scraping) information from the Toronto postal codes Wikipedia webpage, using the BeautifulSoup package, and compiling it into a list of dictionaries which will include the postal codes (aka zipcodes) of the Toronto area, the boroughs within the postal codes of the Toronto area, and the neighborhoods within those boroughs. Then, we will convert the list of dictionaries into a dataframe using pandas, clean up the dataframe a little bit, and then confirming the rows and columns we have in our final dataframe.

First let's import packages that we will need to complete this process.

In [2]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import folium

Now, let's get the link to the Toronto postal codes Wikipedia page and convert it into text.

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_data = requests.get(url).text

Here, we will begin using BeautifulSoup to parse the html text we scraped from the web.

In [4]:
soup = BeautifulSoup(html_data, 'html.parser')

In [5]:
#separate table from html data
table = soup.find('table')

#start with an empty table, sort through the table rows, create the dictionaries and save them to the list
pc_table = []

for tr in table.find_all('td'):
    cell = {} #cell refers to the cell in the table that we will be drawing the information from where each cell has a zipcode, borough, and assosciated neighborhoods
    if tr.span.text == 'Not assigned':
        pass
    else:
        cell['Postal Code'] = tr.p.text[:3]
        cell['Borough'] = (tr.span.text).split('(')[0]
        cell['Neighborhood'] = ((((tr.span.text).split('(')[1]).replace(')', ' ')).replace(' /', ',')).strip(' ')
        pc_table.append(cell)

Now that we have completed the acquisition and organization of our data, let's put it into a dataframe for ease-of-use.

In [7]:
df = pd.DataFrame(pc_table)
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [8]:
df.shape #How many rows and columns do we have?

(103, 3)

Oops! Looks like we have at least one issue with text running together in this dataframe. Let's fix that!

In [9]:
df['Borough'] = df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade': 'Downtown Toronto Stn A',
                                       'East TorontoBusiness reply mail Processing Centre969 Eastern': 'East Toronto Business',
                                       'East YorkEast Toronto': 'East York/East Toronto',
                                       'MississaugaCanada Post Gateway Processing Centre': 'Mississauga'})

In [10]:
display(df)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Now let's confirm the rows and columns once more.

In [12]:
df.shape

(103, 3)

# Using Geocoder to get the Coordinates for Toronto Postal Codes

The process of getting the coordinates for the Toronto Postal Codes can be tricky, but it's a great exercise in problem-solving. First, we may run into issues of knowledge. How do we know that the coordinates are correct? The first task I would like to do is to find out what the coordinates are for the City of Toronto. This should give us a rough estimate of the coordinates we should be seeing as output from the geolocator.

In [13]:
place = 'Toronto, Ontario, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(place)
latitude_toronto = location.latitude
longitude_toronto = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude_toronto, longitude_toronto))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


The next step is to look at the coordinates of our different postal codes. However, if we assume that the geolocator can find the coordinates for all our data, we may find ourselves troubleshooting for coordinates it can't find, albeit repeatedly. Therefore, instead of waiting for the errors to come, let's anticipate that we may not be able to find at least one set of coordinates for whatever reason (misspellings, misreadings, etc.) and setup a <code>try</code> catch for the possibility of such an error.

In [14]:
#Getting coordinates for each zipcode using geocoder
coordinates = {}
zipcodes = list(df['Postal Code'])

for zipcode in zipcodes:
        
    geolocator = Nominatim(user_agent = 'toronto_explorer')

    location = geolocator.geocode('{}, Toronto, Ontario'.format(zipcode))
    
    try:
        latitude = location.latitude    
    except Exception:
        latitude = 'NaN'
    
    try:
        longitude = location.longitude
    except Exception:
        longitude = 'NaN'

    coordinates[zipcode] = [latitude, longitude]
coord_df = pd.DataFrame(coordinates)

As we can see, the geocoder is having some issues with reliability today. So, let's use the geospatial coordinate csv file provided.

<h3> Note: </h3> In a separate file, I will complete the above task differently instead of using the csv file provided. In real life, we will run into issues with our work and there will not be a csv file as a backup to our inability to capture or acquire the data we want, so I will develop a workaround to this issue. </h3>

In [15]:
#Read the csv file in with pandas
csv_file = pd.read_csv('/Users/charaebradshaw/IBM Data Science Professional Certificate/Geospatial_Coordinates.csv')

#Put the file into a dataframe
geo_coord = pd.DataFrame(csv_file)

#Let's view the dataframe
display(geo_coord)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Great! We have finally gotten the coordinates into a dataframe, and they appear to be consistent with Toronto's central coordinates as well. Let's merge the coordinates with the neighborhoods and boroughs information.

In [16]:
#Merge the dataframes together on the Postal Code column
df_geo = df.merge(geo_coord, how='right', on=['Postal Code'])
df_geo.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
