<h1 align=center><font size = 5>Structuring Data of Toronto Neighborhoods</font></h1>

The objective of this notebook is to scrape the Wikipedia page on Toronto postal codes (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

First, let's import the pandas library.

In [1]:
import pandas as pd

### Reading the data from Wikipedia

Let's start by reading the table from the Wikipedia page.

In [2]:
# Note: the read_html commad we use below requires lxml installed. If it's not installed, run the following command first:
# !conda install lxml --yes

list_df = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

This creates a list whose elements are dataframes, with one dataframe for each table in the Wikipedia page, in order of appearance. Since we are only interested in table with information on neighborhoods, let's create a dataframe consisting of the data in the first table.

In [3]:
df = list_df[0]

# Renaming the columns as indicated in the assignment's instructions
df.rename(columns={'Postcode':'PostalCode', 'Neighbourhood':'Neighborhood'}, inplace=True)

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


## Cleaning the data

Now we will start cleaning the data on this dataframe, following the assignment's instructions. First, we deal with the cells containing 'Not assigned'.

In [4]:
for i in df.index:
    # Drop cells with a borough that is listed as 'Not assigned'
    if df['Borough'][i] == 'Not assigned':
        df.drop(i, inplace = True)
    # If a borough is assigned but the neighborhood is not, then the neighborhood will be the same as the borough
    elif df['Neighborhood'][i] == 'Not assigned':
        df['Neighborhood'][i] = df['Borough'][i]

And finally, we combine rows with the same postal codes.

In [5]:
df_toronto = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()

# Let's save this dataframe into a csv file
df_toronto.to_csv('structured_toronto_neighborhoods.csv', index = False)

df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


The final shape of the dataframe is:

In [6]:
df_toronto.shape

(103, 3)