# Exploring Toronto
## Part 1 - Scraping Wiki and Creating the datatframe

In [1]:
#install beautifulsoup4, parser, request, uncomment and run if needed
#!pip install beautifulsoup4
#!pip install lxml
#!pip install requests

In [2]:
#import libs
import pandas as pd
import requests
from bs4 import BeautifulSoup

Page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(Page.content,'lxml')
table = soup.find_all('table')[0] 
table_rows = table.find_all('tr')

#create a dictionary of all the rows
data = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        data.append(row)

#create dataframe with data dictionary
df = pd.DataFrame(data, columns=["PostalCode", "Borough", "Neighbourhood"])

### The initial dataframe
The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [3]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Removing 'Not assigned' Boroughs
Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [4]:
# recreate dataframe where Borough not equals 'Not assigned'
df=df[df.Borough != 'Not assigned']
df.head(8)

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue


### Replacing 'Not assigned' Neighbourhood with Borough
If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 

In [5]:
# Replace Neighbourhood with Borough where Neighbourhood = 'Not assigned'
for index, row in df.iterrows():
    if row['Neighbourhood']=='Not assigned':
        row['Neighbourhood']= row['Borough']
df.head(8)

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue


### Combine Neighbourhoods with Common Postal Codes
More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [6]:
#set postal code and Borough as the index
df.set_index(['PostalCode','Borough'],inplace=True)

#Combine Rows of Multiindex DataFrame into Comma Separated Lists
df = df.groupby(level=['PostalCode','Borough'], sort=False).agg( ','.join)

#reset the index
df.reset_index()

df.head(8)

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighbourhood
PostalCode,Borough,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Harbourfront,Regent Park"
M6A,North York,"Lawrence Heights,Lawrence Manor"
M7A,Queen's Park,Queen's Park
M9A,Etobicoke,Islington Avenue
M1B,Scarborough,"Rouge,Malvern"
M3B,North York,Don Mills North


In [7]:
df.shape

(103, 1)