## Install required packages
<hr>
First of all, let's install some packages to scrape Wikipedia page.

In [1]:
#install beautifulsoup4
!pip install beautifulsoup4

Collecting beautifulsoup4
  Downloading https://files.pythonhosted.org/packages/3b/c8/a55eb6ea11cd7e5ac4bacdf92bac4693b90d3ba79268be16527555e186f0/beautifulsoup4-4.8.1-py3-none-any.whl (101kB)
Collecting soupsieve>=1.2
  Downloading https://files.pythonhosted.org/packages/81/94/03c0f04471fc245d08d0a99f7946ac228ca98da4fa75796c507f61e688c2/soupsieve-1.9.5-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.8.1 soupsieve-1.9.5


In [2]:
#install lxml
!pip install lxml



In [3]:
#install html5lib
!pip install html5lib

Collecting html5lib
  Downloading https://files.pythonhosted.org/packages/a5/62/bbd2be0e7943ec8504b517e62bab011b4946e1258842bc159e5dfde15b96/html5lib-1.0.1-py2.py3-none-any.whl (117kB)
Installing collected packages: html5lib
Successfully installed html5lib-1.0.1


In [4]:
#install requests
!pip install requests



## Web scraping and datafile creation
<hr>

Now we will scrape wikipedia web page to get the table. Then, we create a csv file that contains extracted data. The file is called `Toronto_neighborhood.csv`and does not contain rows where Borough is not assigned.

In [5]:
from bs4 import BeautifulSoup
import requests

In [64]:
#get table source code
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')
#print(soup.prettify())

In [33]:
#parse table and create csv file 
import csv
table = soup.find('table', class_='wikitable sortable')
table_body = table.find('tbody')
with open('Toronto_neighborhood.csv', 'w') as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(['PostalCode', 'Borough', 'Neighborhood'])

    rows = table_body.find_all('tr')
    for row in rows[1:]:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        if cols[1] != 'Not assigned':#when Borough is not assigned
            csv_writer.writerow(cols)

## CSV file preprocessing
<hr>
In this section we will combine Neighborhood that corresponds to the same Postal code. Then, Neighborhood that is not assigned will be the same as Borough.

In [35]:
import pandas as pd
toronto_df = pd.read_csv('Toronto_neighborhood.csv')

In [44]:
toronto_df.shape

(210, 3)

In [45]:
# create a table that contains combined Neighborhood
combined_df = toronto_df.groupby(['PostalCode'])['Neighborhood'].apply(lambda x: ','.join(x)).reset_index()

In [51]:
# Merge the two tables to obtain the combined table
toronto_df = pd.merge(toronto_df[['PostalCode', 'Borough']].drop_duplicates(), combined_df, on='PostalCode', how='inner')

In [54]:
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Not assigned


In [78]:
#Replace Not assigned Neighborhood with Borough
toronto_df = toronto_df.apply(lambda x: x.replace(x['Neighborhood'],x['Borough'] if (x['Neighborhood']=='Not assigned') else x['Neighborhood']) , axis=1)

In [79]:
toronto_df.shape

(103, 5)

## Add Longitude and lattitude data
<hr>

In [73]:
#read csv file for Geospatial coordinates
geo_coord = pd.read_csv('Geospatial_Coordinates.csv')

In [86]:
geo_coord

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [87]:
#Merge toronto_df and geo_coord dataframe. Then delete 'Postal Code' column
toronto_df = pd.merge(toronto_df, geo_coord, how='left', left_on='PostalCode', right_on='Postal Code')
toronto_df.drop('Postal Code', axis=1, inplace=True)

In [88]:
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
