## Task 1: Retrieve and Clean Data

Week 3: Applied Data Science Capstone

**GIVEN:**
Wikipedia Page link which contains Postal Codes of Canada.

**GOAL:** Transforming the Data from given Wikipedia Page into pandas DataFrame like shown below:

<img src="Example.png" width="500">

This transformed DataFrame must fulfill following criterias:

* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned**.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that **M5A** is listed twice and has two neighborhoods: **Harbourfront** and **Regent Park**. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in **row 11** in the above table.
* If a cell has a borough but a **Not assigned** neighborhood, then the neighborhood will be the same as the borough.
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
* In the last cell of your notebook, use the **.shape** method to print the number of rows of your dataframe.

### Download, Scrape and Wrangle Data from Wikipedia Page and Create a DataFrame

In [1]:
# import modules
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
# download data 
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
raw_data = requests.get(url).text
soup_raw_data = BeautifulSoup(raw_data, 'lxml')

In [3]:
# find table tag which contains all postal codes
table = soup_raw_data.find('table', {'class': 'wikitable'})

In [4]:
# extract data from table
table_rows = table.find_all('tr')

row_data = []
for tr in table_rows[1:]:
    td = tr.find_all('td')
    row_data.append([data.text.strip() for data in td])

In [5]:
# create a DataFrame
column_names = ['PostalCode', 'Borough', 'Neighborhood']
temp_df = pd.DataFrame(row_data, columns=column_names)
temp_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


### Data Cleaning

In [6]:
# remove row where Borough is "not assigned"
temp_df = temp_df[temp_df['Borough'] != 'Not assigned']
temp_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In [7]:
# Replace '/' with ', '
temp_df['Neighborhood'] = temp_df['Neighborhood'].apply(lambda x: x.replace(' / ', ", "))

# Combine Neighborhoods with same PostalCode and Borough
df = temp_df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [9]:
# print shape 
print(f"The Shape of DataFrame is : {df.shape}")

The Shape of DataFrame is : (103, 3)
