# Part One
## Web Scraping and Data Cleaning

I known you can read the url directly from Pandas, but let's practice using Beautiful Soup. We'll use requests to get the html code from the url, then use Beatuiful Soup to find the table on the page. We'll then pass that table to pandas to make a dataframe

In [1]:
# import dependencies
from bs4 import BeautifulSoup
from pprint import pprint
import requests
import pandas as pd
import numpy as np

In [2]:
# get html info using requests
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(url)
doc = r.content

In [3]:
# read html into bs object
soup = BeautifulSoup(doc, 'lxml')
# print(soup.prettify())

In [4]:
# get just the table on the wikipedia page
table = soup.find('table')
print(table.prettify())

<table class="wikitable sortable">
 <tbody>
  <tr>
   <th>
    Postcode
   </th>
   <th>
    Borough
   </th>
   <th>
    Neighbourhood
   </th>
  </tr>
  <tr>
   <td>
    M1A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M2A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M3A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Parkwoods" title="Parkwoods">
     Parkwoods
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M4A
   </td>
   <td>
    <a href="/wiki/North_York" title="North York">
     North York
    </a>
   </td>
   <td>
    <a href="/wiki/Victoria_Village" title="Victoria Village">
     Victoria Village
    </a>
   </td>
  </tr>
  <tr>
   <td>
    M5A
   </td>
   <td>
    <a href="/wiki/Downtown_Toronto" title="Downtown Toronto">
     Downtown Toronto
    </a>
   </td>
   <td>
    <a href="

In [5]:
# cast the bs object for the table into a str so pandas can read it, then convert to df
df = pd.read_html(str(table))[0] # read_html returns a list of dfs. there should only be 1 table here, so index 0 should be the correct one
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Data Cleaning
Criteria:
* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

#### Data Exploration
Let's expore the data a little so we can see what we need to do. How many Boroughs are 'Not Assigned'? How many post codes have multiple neighborhoods?

In [6]:
# use .info() method to see how much data we have and if check for null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 3 columns):
Postcode         288 non-null object
Borough          288 non-null object
Neighbourhood    288 non-null object
dtypes: object(3)
memory usage: 6.8+ KB


In [7]:
# how many not assigned vlaues are there?
df['Borough'].value_counts()

Not assigned        77
Etobicoke           45
North York          38
Downtown Toronto    37
Scarborough         37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Queen's Park         1
Mississauga          1
Name: Borough, dtype: int64

Let's drop all the rows that have a 'Not Assigned' Borrough

In [8]:
# filter out any rows where the Borough column is 'Not assigned'
df = df[df['Borough'] != 'Not assigned']
df['Borough'].value_counts() # verify value counts

Etobicoke           45
North York          38
Downtown Toronto    37
Scarborough         37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Queen's Park         1
Mississauga          1
Name: Borough, dtype: int64

Next, let's find the rows where the neighbourhoods are 'Not assigned'. Once we've found those rows, we can set the 'Neighborhood' column equal to the 'Borough' column

In [9]:
# Find boroughs where the neighborhood is not assigned
no_neighborhood = df[df['Neighbourhood'] == 'Not assigned']
no_neighborhood

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Not assigned


Now that we know what rows to change, we can slice those rows and set the 'Neighbourhood' equal to the 'Borough'. There's technically only one row that we have to change for this data set, but this code would work in case there were multiple rows. It's good for practice.

In [10]:
# Find the indices of the desired rows
indices_to_change = no_neighborhood.index
# slice and change df
df.loc[indices_to_change, 'Neighbourhood'] = df.loc[indices_to_change, 'Borough']

# Verify Change
df[df['Borough'] == "Queen's Park"]

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Queen's Park


In [11]:
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


The last step of the cleaning is to concactenate all neighbourhoods in each borough into one row. We can group by Postcode and then Borugh, then join each neighborhood in heaf group separated by ', '

In [12]:
# Group the df and get the series for 'Neighborhood'. Then join each str, separated vy ', '. Convert the new series to a df afterwards
clean_df = df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(', '.join).to_frame()

# reset index
clean_df.reset_index(inplace=True)

clean_df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [13]:
# display shape of df
clean_df.shape

(103, 3)

# Part Two
## Geocoding
We need to get coordinates (lattitude and longtitude) for each postal code. Geocoder wasn't working for me, and I tried other sources like geopy and mapbox with mixed results. Map box, was sucessful, but gave slightly different results from the provided csv. For grading simplicity and continuity, I'm going to use the csv.

After reding the data from the provided csv, Let's build a dataframe of coordinates for each unique postal code, then merge that df with the previously made df containing the borroughs and neighborhoods.

In [35]:
# Import dependencies
import geopy

In [36]:
# get list of unique post codes
post_codes = clean_df['Postcode'].unique()
post_codes

array(['M1B', 'M1C', 'M1E', 'M1G', 'M1H', 'M1J', 'M1K', 'M1L', 'M1M',
       'M1N', 'M1P', 'M1R', 'M1S', 'M1T', 'M1V', 'M1W', 'M1X', 'M2H',
       'M2J', 'M2K', 'M2L', 'M2M', 'M2N', 'M2P', 'M2R', 'M3A', 'M3B',
       'M3C', 'M3H', 'M3J', 'M3K', 'M3L', 'M3M', 'M3N', 'M4A', 'M4B',
       'M4C', 'M4E', 'M4G', 'M4H', 'M4J', 'M4K', 'M4L', 'M4M', 'M4N',
       'M4P', 'M4R', 'M4S', 'M4T', 'M4V', 'M4W', 'M4X', 'M4Y', 'M5A',
       'M5B', 'M5C', 'M5E', 'M5G', 'M5H', 'M5J', 'M5K', 'M5L', 'M5M',
       'M5N', 'M5P', 'M5R', 'M5S', 'M5T', 'M5V', 'M5W', 'M5X', 'M6A',
       'M6B', 'M6C', 'M6E', 'M6G', 'M6H', 'M6J', 'M6K', 'M6L', 'M6M',
       'M6N', 'M6P', 'M6R', 'M6S', 'M7A', 'M7R', 'M7Y', 'M8V', 'M8W',
       'M8X', 'M8Y', 'M8Z', 'M9A', 'M9B', 'M9C', 'M9L', 'M9M', 'M9N',
       'M9P', 'M9R', 'M9V', 'M9W'], dtype=object)

In [107]:
# define function to get the coordinates from each postal code
def get_toronto_coordinates(postal_code):
    '''Takes a postal code in Toronto and returns geopy location object containing latitude and longitude of that postal code'''
    query = f'Toronto, Ontario, {postal_code}' # build search string

    locator = geopy.Nominatim(user_agent='myGeoCoder')
    location = locator.geocode(query) # perform search
    return(location)

In [109]:
# test for one location
postal_code = 'M1E'
location = get_toronto_coordinates(postal_code)
print(f'coordinates: {location.latitude}, {location.longitude}')

AttributeError: 'NoneType' object has no attribute 'latitude'

In [110]:
# let's try mapbox
import requests

In [127]:
base_url = 'https://api.mapbox.com/geocoding/v5/' # {endpoint}?access_token={your_access_token}'

# input user for access token
access_token = input('Enter your access token here: ')

Enter your access token here: pk.eyJ1IjoiYWpzdGFubGV5ODkiLCJhIjoiY2syeTByNTNsMDM4NTNidDlpb3U3eHNzcyJ9.9yy76ayUepiSgi6HsA_MGg


In [130]:
endpoint = 'mapbox.places'
search_text = postal_code
# search_text = 'Toronto'

# build query
query = base_url + endpoint + '/' + search_text + '.json' + '?country=CA' + '&access_token=' + access_token
print(query)

https://api.mapbox.com/geocoding/v5/mapbox.places/M1E.json?country=CA&access_token=pk.eyJ1IjoiYWpzdGFubGV5ODkiLCJhIjoiY2syeTByNTNsMDM4NTNidDlpb3U3eHNzcyJ9.9yy76ayUepiSgi6HsA_MGg


In [131]:
r = requests.get(query)
r

<Response [200]>

In [133]:
json = r.json()

In [134]:
json

{'type': 'FeatureCollection',
 'query': ['m1e'],
 'features': [{'id': 'postcode.3645352727997350',
   'type': 'Feature',
   'place_type': ['postcode'],
   'relevance': 1,
   'properties': {},
   'text': 'M1E 3H8',
   'place_name': 'M1E 3H8, Toronto, Ontario, Canada',
   'center': [-79.18226, 43.76364],
   'geometry': {'type': 'Point', 'coordinates': [-79.18226, 43.76364]},
   'context': [{'id': 'place.1521272661948910',
     'wikidata': 'Q172',
     'text': 'Toronto'},
    {'id': 'region.7377835739263190',
     'short_code': 'CA-ON',
     'wikidata': 'Q1904',
     'text': 'Ontario'},
    {'id': 'country.4282270149587150',
     'short_code': 'ca',
     'wikidata': 'Q16',
     'text': 'Canada'}]},
  {'id': 'postcode.3502552036271570',
   'type': 'Feature',
   'place_type': ['postcode'],
   'relevance': 1,
   'properties': {},
   'text': 'M1E 3P5',
   'place_name': 'M1E 3P5, Toronto, Ontario, Canada',
   'center': [-79.17512, 43.76018],
   'geometry': {'type': 'Point', 'coordinates': [-79

In [137]:
# creat test df from the 'features' key
test_df = pd.DataFrame(json['features'])
test_df = test_df[['center']]
test_df.head()

Unnamed: 0,center
0,"[-79.18226, 43.76364]"
1,"[-79.17512, 43.76018]"
2,"[-79.17836, 43.75973]"
3,"[-79.17038, 43.76958]"
4,"[-79.18371, 43.75097]"


In [138]:
# split the list contained in the center column to sperate longitude and latitude columns
test_df['longitude'] = test_df['center'].apply(lambda x: x[0])
test_df['latitude'] = test_df['center'].apply(lambda x: x[1])
test_df

Unnamed: 0,center,longitude,latitude
0,"[-79.18226, 43.76364]",-79.18226,43.76364
1,"[-79.17512, 43.76018]",-79.17512,43.76018
2,"[-79.17836, 43.75973]",-79.17836,43.75973
3,"[-79.17038, 43.76958]",-79.17038,43.76958
4,"[-79.18371, 43.75097]",-79.18371,43.75097


In [139]:
# get the average of each center
test_df.mean()

longitude   -79.177966
latitude     43.760820
dtype: float64