# Loading Toronto neighborhoods into a dataframe

by _Marc Behrens_

In [1]:
import pandas as pd
import numpy as np

## Importing Table from Wikpedia site

In [2]:
# Read the first table form the wikipedia site.
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

In [3]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [4]:
# Let's have a look at the size right after importing
df.shape

(180, 3)

So,right after importing, our dataframe has 180 rows.

## Delete rows without borough
One condition is "Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned."
So, we have to drop the rows with Borough="Not Assigned"

In [5]:
# Drop rows wwhere Borough equals "Not Assigned"
df.drop(df[(df.Borough == 'Not assigned')].index, inplace=True)

In [6]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [7]:
# Let's see how many rows are left
df.shape

(103, 3)

So, after deleting the rows without borough, we have 103 rows left.

## Group rows with diplicated Postal Code
Condition "More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table."
So, we have to find the postal codes that are in more than one line.

In [8]:
# Check if there are any duplicate Postal Codes
df.groupby('Postal Code').filter(lambda x: len(x) > 1)

Unnamed: 0,Postal Code,Borough,Neighbourhood


We find that aren't any.

In [9]:
# Check with the postal code stated in the task: M5A
df[df['Postal Code'] == 'M5A']

Unnamed: 0,Postal Code,Borough,Neighbourhood
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


It seems, this work has already been done by the creators of this table.
So, there was nothing to be done about it.

## Assign Borough Name to Neighbourhoods without Name
Condition "If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough."
We will have to find the rows with borough but without neighbourhood.

In [10]:
# Find rows with Borough assigned but neighbourhood unassigned.
df[(df.Borough != 'Not assigned') & (df.Neighbourhood == 'Not assigned')]

Unnamed: 0,Postal Code,Borough,Neighbourhood


We find, there is none. So, we don't have to do anything.

In [11]:
# Check if there are already Neighbourhoods that have their Boroughs Name
df[(df.Borough == df.Neighbourhood)]

Unnamed: 0,Postal Code,Borough,Neighbourhood


There is none.

## Check the final shape
Condition "In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe."

In [12]:
# Check shape
df.shape

(103, 3)

So, there are 103 rows in the end.