## Week 3 - Assignment - Create Dataframe

### Neighbourhoods in Toronto

<b>Import the required libraries</b>

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

<b>Get web page containing data</b>

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'  # URL containing table
page = requests.get(url)  # pull page

<b>Parse web page for table</b>

In [3]:
soup = BeautifulSoup(page.content)

table = soup.find('table',{'class':'wikitable sortable'})  # find table in page

column_names = [header.text[:-1] for header in table.find_all('th')]  # find column names

row_data = table.find_all('tr')  # find rows
rows = list()
for row in row_data:
    cells = row.find_all('td')
    value = [cell.text[:-1] for cell in cells]
    rows.append(value)

<b>Convert table data into a Dataframe</b>

In [4]:
table_data = pd.DataFrame(rows[1:])
table_data.columns = column_names
table_data.tail()

Unnamed: 0,Postal Code,Borough,Neighbourhood
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."
179,M9Z,Not assigned,Not assigned


<b>Clean and process the Dataframe</b>

In [5]:
# Step 1: Drop unassigned Boroughs
mask = table_data['Borough'] == 'Not assigned'
step1 = table_data[~mask]

# Step 2: Expand Neighbourhood column
step2 = pd.concat([step1, step1['Neighbourhood'].str.split(',', expand=True)], axis=1)
step2.drop('Neighbourhood', inplace=True, axis=1)

# Step 3: Re-assign Postal code and Borough to Neighbourhoods
step3 = step2.melt(id_vars=['Postal Code', 'Borough'], value_name='Neighbourhood', value_vars=[0, 1, 2, 3, 4, 5, 6, 7])
step3.drop('variable', axis=1, inplace=True)

# Step 4: Drop rows with empty values
step4 = step3.dropna()
step4.reset_index(drop=True, inplace=True)
df = step4

In [6]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park
3,M6A,North York,Lawrence Manor
4,M7A,Downtown Toronto,Queen's Park


In [7]:
print(f'The table has {df.shape[0]} rows and {df.shape[1]} columns')

The table has 217 rows and 3 columns
