# Scraping the Toronto Postal Codes

This project requires a complete list of Toronto city postal codes.  Therefore, we will be scraping said list from [this](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) Wikipedia article.

## Goal

1. Identify the format of the table on Wikipedia,
2. Scrape the table, using BeautifulSoup
3. Organize said table into a pandas DF
4. Save the df as a CSV file


In [1]:
# Import required libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd 
import numpy as np

## Part 1, Identify table structure

If one inspects the table on Wikipedia, it appears to be of the following format:

```html
<table>
    <thead> 
        <tr> ... <th>column names</th> ... </tr>
    </thead>
    <tbody> 
        ...
        <tr> ... <td>column values</td> ...</tr>
        ....
    </tbody>
</table>

```

It appears simple enough to be able to parse the thing in an iterative fashion.  Furthermore, a quick search in the developer console confirms that the table we want is the first table on the page.


## Part 2, Scrape the table 

In [9]:
# get text of html page 

URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page_html = requests.get(URL).text

# convert to soup 

soup = BeautifulSoup(page_html, 'lxml')

# get the table 

table = soup.find('tbody')


In [95]:
# print to confirm acquisition

print(type(table))

<class 'bs4.element.Tag'>


## Part 3, Organize into a data frame 


In [89]:
# Set up temporary arrays to hold values 

codes = []
boroughs = []
neighborhoods = []

# Iterate through table rows

for row in table.find_all('tr'):
    # Assume following structure,
    #   0 - code
    #   1 - borough
    #   2 - neighborhood 
    tds = row.find_all('td')
    try:
        codes.append(tds[0].text)
    except:
        codes.append(None)
    try:
        boroughs.append(tds[1].text)
    except:
        boroughs.append(None)
    try:
        neighborhoods.append(tds[2].text)
    except:
        neighborhoods.append(None)

# Create data frame 

codes_df = pd.DataFrame({
    "code": codes,
    "borough":boroughs,
    "neighborhood": neighborhoods
})

# Remove first row and reset index

codes_df.drop(0, inplace=True)
codes_df.reset_index(drop=True, inplace=True)

# Remove \n from each field

def remove_newline(s):
    return s.replace('\n', '')

codes_df['code'] = list(map(remove_newline, codes_df['code']))
codes_df['borough'] = list(map(remove_newline, codes_df['borough']))
codes_df['neighborhood'] = list(map(remove_newline, codes_df['neighborhood']))

# Drop rows without an assigned borrow

index = codes_df['borough'] != "Not assigned"
codes_df = codes_df[index]

codes_df.reset_index(drop=True, inplace=True)

codes_df

Unnamed: 0,code,borough,neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing Centre
101,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


### Note

The assignment on Coursera's website lists the following instructions:

* Combine duplicated postal codes
* Assign borough name to unassigned neighborhoods

An examination of the above dataframe will reveal that there are no such features.

In [84]:
# Duplicated postal code?

unique_codes = len(codes_df['code'].unique())
total_codes = len(codes_df['code'])

print(f"There are {total_codes - unique_codes} duplicated codes.")

# Rows without neighborhood names

no_neigh = sum(codes_df['neighborhood'] == "Not assigned")
print(f"There are {no_neigh} instances without neighborhood names.")


There are 0 duplicated codes.
There are 0 instances without neighborhood names.


In [90]:
# Rename columns to match rubric

codes_df.columns = ["PostalCode", "Borough", "Neighborhood"]
codes_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing Centre
101,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


In [92]:
# Save the dataframe to a file 

codes_df.to_csv('toronto_data.csv')

In [93]:
# Print te shape of the dataframe 

codes_df.shape

(103, 3)