<h2>Part 1: Create Dataframe</h2>

This will create a dataframe using data imported from *https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M*
and then show the shape of the dataframe at the end of it all. A few libraries need to be imported into Python in order to accomplish these things and reach the final dataframe. There were two methods suggested for this with the first being *pandas* and the second being *BeautifulSoup*. We will use *pandas*.

Start with importing the Python libraries that will be used.

In [1]:
import pandas as pd
import requests
import numpy as np

We know the url: *https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M*. With that, we send a request to get the data from that html page. The <Response [200]> shows that connection to the html page was successful.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
n_url = requests.get(url)
n_url

<Response [200]>

With the connection to the html page established, we can now get the data tables from the page using read_html. This will give us a list containing the tables on the page.

In [3]:
list_tables = pd.read_html(n_url.text)

Looking at *https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M*, we can see that there are multiple tables in this returned list and we only want one so let's get it. We want entries with a neighbourhood so let's change any entries without a neighbourhood to have one that matches their borough. We also don't want entries that don't have a borough at all so let's get the entries that do have boroughs assigned.

In [4]:
postal_table = list_tables[0]

for i in range(len(postal_table)-1):
    if postal_table['Neighbourhood'][i] == 'Not assigned':
        postal_table['Neighbourhood'][i] = postal_table['Borough'][i]

postal_table = postal_table[postal_table.Borough != 'Not assigned']

We now have the data we want but we pulled out some of the rows to get to this point. If we check what we have now, we can see that our indexes are kind of odd now.

In [5]:
postal_table.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Reset the index so we can start from 0 and also remove the gaps in the index.

In [6]:
postal_table_indexed = postal_table.reset_index(drop=True)
postal_table_indexed.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


Our dataframe now has no unassigned neighbourhoods, no unassigned boroughs, and a continuous index.

This next step may be unnecessary depending on how we go about accomplishing our next goal of combining the neighbourhoods with the same "Postcode". For what is happening here though, it is necessary to sort our data by the column "Postcode" since the loop we are using looks at two rows that are next to each other.

With our new *postal_table_sorted*, we can now move to combining the neighbourhoods that share the same postcode. This loop will result in dropped rows so an index reset will be used again. There is no need to rename but we can choose to do so here for some more clarity.

In [7]:
postal_table_sorted = postal_table_indexed.sort_values(by=['Postcode'])

for i in range(len(postal_table_sorted)-1):
    if postal_table_sorted['Postcode'][i] == postal_table_sorted['Postcode'][i+1]:
        postal_table_sorted['Neighbourhood'][i+1] = (postal_table_sorted['Neighbourhood'][i] + ', ' + postal_table_sorted['Neighbourhood'][i+1])
        postal_table_sorted.drop([i], inplace=True)
        
postal_table_sorted.head()

postal_table_combined = postal_table_sorted
postal_table_combined = postal_table_combined.reset_index(drop=True)
postal_table_combined.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


We now have our nice data all cleaned out and sorted. Last, we get some numpy in there to get the shape of our data for 103 rows and 3 columns.

In [8]:
np.shape(postal_table_combined)

(103, 3)