# Neighbourhoods in Toronto 1

## Importing postcode data from Wikipedia

### Task outline

Use your Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M_, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

<img src = "https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1584835200000&hmac=84rAo3qtMUmsLzK61Xzy7ADTKueCCLuqudiQQ2-2Q-g" width = 400>

To create the above dataframe:

* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

#### Update
For goodness sake!  I spent *hours* figuring out exactly how to extract the tags from the table on the current Wikipedia page.  Then I see that I can use a previous version of the page if I choose, someone suggested https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050._ which makes the whole thing trivial.
*Unimpressed*.

### Coding

In [1]:
#url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
url = 'https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050.'

I will be using the lxml parser.  It's probably installed on your system but if not uncomment the next cell:

In [2]:
#!pip3 install lxml

In [3]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
print('Loading url... ', end='')
html = requests.get(url).text
print('done.\nParsing markup...', end='')
parsed = BeautifulSoup(html, 'lxml')
print('done.')

Loading url... done.
Parsing markup...done.


In [4]:
print('Extracting information... ', end='')

# Find the table (there's only one so 'find' is good enough)
table = parsed.find('table',{'class':'wikitable sortable'})

# Make a collection of separate rows
rows = table.find_all('tr')

# Lists to hold all the data we want
pcodes = []
boroughs = []
neighbourhoods = []

for row in rows[1:]:
    # Each row has three cells
    postcode, borough, neighbourhood = row.find_all('td')
    
    # Get rid of the bumf in each tag
    postcode = postcode.string
    borough = borough.string
    # Some neighbourhoods come out with newlines attached
    # Sometimes they are singletons but [0] still works as
    # these are not *actually* strs
    neighbourhood = str(list(neighbourhood.strings)[0]).rstrip()
    
    #Skip the row if there is no borough
    if (borough != 'Not assigned'):
        #Assign neighbourhood the borough name if none is assigned
        if (neighbourhood == 'Not assigned'):
            neighbourhood = borough
        pcodes.append(postcode)
        boroughs.append(borough)
        neighbourhoods.append(neighbourhood)
print('done.')

Extracting information... done.


Now we have the data in three lists it is time to scrape over each postcode extracting the borough and a list of all the neighbourhoods my structure is two dicts each of whose keys are ```postcode``` and the values are ```borough``` and ```[neighbourhoods]``` respectively.

In [5]:
# This cell is not idempotent as I decided to reuse variable names.
# Don't like it?  Sue me.
codes = pd.DataFrame({'borough':boroughs, 'neighbourhood':neighbourhoods},
                     index = pcodes)
postcodes = list(dict.fromkeys(pcodes).keys())
neighbourhoods={} #I'll have one df where these are single strings
hoodlists={}  #And one where they are lists
boroughs = {}
postalcodes = {}
for code in postcodes:
    postalcodes[code] = code
    # ._to_list() fails on a singleton neighbourhood so if it fails, catch
    # the exception and handle it as single neighbourhood.  It's a little
    # bit cleaner than a further if statement.
    try: #Multiple boroughs need Series -> list and one borough name
        hoodlists[code] = codes['neighbourhood'][code].to_list()
        neighbourhoods[code] = ', '.join(hoodlists[code])
        #The borough names will all be the same, so just choose the first
        boroughs[code] = codes['borough'][code].to_list()[0]
    except: #Single boroughs need item->[item] and the borough name
        hoodlists[code] = [codes['neighbourhood'][code]]
        neighbourhoods[code] = hoodlists[code][0]
        #Unlike above, here there will only be a single borough name
        boroughs[code] = codes['borough'][code]

#This is my final answer to Q1
final = pd.DataFrame({'PostalCode': postalcodes,
                      'Borough':boroughs,
                      'Neighbourhood':neighbourhoods})
#I may not need a list of neighbourhoods but it took no additional effort
final_lists = pd.DataFrame({'PostalCode': postalcodes, 
                          'Borough':boroughs,
                          'Neighbourhood':hoodlists})

In [6]:
# If your screen is not wide enough, just reduce this
pd.set_option('max_colwidth', 150)

In [7]:
# Reset the index to match the question format
display = final.reset_index(drop = True, inplace = False)
display

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
101,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park South East, Mimico NE, Old Mill South, The Queensway East, Royal York South East, Sunnylea"


In [8]:
print(f'The data frame has {final.shape[0]} rows.')

The data frame has 103 rows.
