# Scraping the neighborhoods in Toronto

## 1. Import the required libraries
 1. BeatifulSoup for parsing the html
 2. Requests to get the handle of the webpage that is to be used in BeautifulSoup
 3. Pandas is used to generate the final data frame

In [1]:
# 1. Import the required libraries

from bs4 import BeautifulSoup
import requests
import pandas as pd

## 2. Scraping the web page

### 1. Scraping the raw text

In [3]:
# 2. Scrape the Wikipedia page

# Get the page source & save to source
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

# Scrape the page
soup = BeautifulSoup(source, 'lxml')
table = soup.find_all('table')[0] # Only grab the 1st table

### 2. Extract the table

It can be found that the required __Postal Code__, __Borough__, and __Neighborhood__ are all in the ```<tr>...</tr>``` tags in the __first table__. Thus, the table can be constructed by iterate all these tags and parse the corresponding entries into the data frame.

In [4]:
# Build the data frame
column_names = ['PostalCode', 'Borough', 'Neighborhood']
neigh_tor_df = pd.DataFrame(columns = column_names)

# Ierate the <tr> tags while filtering the "Not assigned" rows
for tr in table.find_all('tr')[2:]:
    tds = tr.find_all('td')
    
    postalcode = tds[0].text
    borough = tds[1].text
    neighborhood = tds[2].text.strip('\n') # Strip the '\n' parameter
        
    # Ignore the "Not assigned" borough
    if(borough !='Not assigned'):
        if(neighborhood == 'Not assigned'):
            neighborhood = borough
        neigh_tor_df = neigh_tor_df.append({
            'PostalCode': postalcode,
            'Borough': borough,
            'Neighborhood': neighborhood}, ignore_index = True)

print(neigh_tor_df.shape[0]) # entire size
print("Number of unique post codes: ", len(neigh_tor_df['PostalCode'].unique())) # Unique post codes

212
Number of unique post codes:  103


### 3. Combine the neighborhoods for the same postal codes

In [5]:
# Combine the neighborhoods for the same 
neighborhood_aggregator = lambda a: ", ".join(a)
df = neigh_tor_df.groupby('PostalCode').agg({'Borough': 'first', 'Neighborhood': neighborhood_aggregator}).reset_index()

print("Number of unique post codes: ", len(neigh_tor_df['PostalCode'].unique())) # Unique post codes
print("Number of post codes in cleaned data frame: ", df.shape[0])

df

Number of unique post codes:  103
Number of post codes in cleaned data frame:  103


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"
