# Coursera - Data Science Capston project


## Getting geographical coordinates of Toronto boroughs

Aim of this Notebook is to obtain the geographical coordinates of Toronto Neighbourhood by using their postal codes.


### Scraping wikipedia page

We get the list of Toronto postal code, boroughs and neighbourhood from wikipedia.


In [1]:
# Importing libraries
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

A lot of sites have precautions to fend off scrapers from accessing their data, so we have to spoof the headers we send along with our requests to make our scraper look like a legitimate browser:

In [2]:
headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }

Now let's fetch a page and inspect it


In [3]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')

Get the table from HTML


In [4]:
found_table = soup.find('table',class_ = 'wikitable sortable')

And then get all the rows of the table.
The rows will be mapped in a list.


In [5]:
values = found_table.find_all('td')
entries = []
for row in values:
    entries.append(row.get_text().strip())


The following function is needed to chunks the rows in n elements. In our case the elements are three : PostalCode, Borough and Neighbourhood.

In [6]:
# Create a function called "chunks" with two arguments, l and n:
def chunks(l, n):
    # For item i in a range that is a length of l,
    for i in range(0, len(l), n):
        # Create an index range for l of n items:
        yield l[i:i+n]
        
rows_data = list(chunks(entries, 3))

Now we are ready to create the dataframe and populate it.


In [7]:
df = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighbourhood'])

for ele in rows_data:
    temp_df = pd.DataFrame([ele], columns=['PostalCode', 'Borough', 'Neighbourhood'])
    df = df.append(temp_df).reset_index(drop=True)    

Remove all the columns where the Borough is set to Not assigned


In [8]:
df = df[df['Borough'] != 'Not assigned']

## Getting geographical coordinate


Because the service from Geocoder package is not reliable we choose to use the csv file from http://cocl.us/Geospatial_data


In [9]:
df_postal_code = pd.read_csv('http://cocl.us/Geospatial_data') 

In [10]:
df_postal_code.columns

Index(['Postal Code', 'Latitude', 'Longitude'], dtype='object')

In [11]:
lat_dictionary = pd.Series(df_postal_code.Latitude.values,index=df_postal_code['Postal Code']).to_dict()
lon_dictionary = pd.Series(df_postal_code.Longitude.values,index=df_postal_code['Postal Code']).to_dict()

In [12]:
# Define a function to map the values 
def set_value(row_number, assigned_value): 
    return assigned_value[row_number] 

df['Latitude'] = df['PostalCode'].apply(set_value, args =(lat_dictionary, ))
df['Longitude'] = df['PostalCode'].apply(set_value, args =(lon_dictionary, ))

A couple of tests to check the update dataset


In [13]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.753259,-79.329656
3,M4A,North York,Victoria Village,43.725882,-79.315572
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [14]:
df[df['PostalCode'] == "M4B"]

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
12,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
