# The objective of this notebook is to extract the names and postal codes of the neighborhoods of Toronto, which are then used to obtain their corresponding longitude and latitude to be used with Foursquare's API to find the venues in these neighborhoods to be clustered according to the most common ones

## The names and postal codes of neighborhoods of Toronto can be found on this wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

#### The first step is to install the necessary packages that will be used to parse the html code of wikipedia's page

In [1]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


#### By importing the BeautifulSoup and requests packages, we can use them to access and parse wikipedia's webpage to extract the necessary information

In [2]:
from bs4 import BeautifulSoup
import requests

In [3]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [4]:
soup = BeautifulSoup(source, 'lxml')

#### By exploring the webpage, we notice that the table that contains the information of interest actually corresponds to the part of the html code that comes after the header table

#### This section cleans the obtained text to extract the important information into a DataFrame as required

In [5]:
# Rmove newlines and '' elements of the parsed text
data = soup.find('table').text.split('\n')
data = [d for d in data if not d == '']
data[:10]

['Postcode',
 'Borough',
 'Neighbourhood',
 'M1A',
 'Not assigned',
 'Not assigned',
 'M2A',
 'Not assigned',
 'Not assigned',
 'M3A']

In [6]:
header = data[:3] # Captures the header of the table
data = data[3:]
data[:10]

['M1A',
 'Not assigned',
 'Not assigned',
 'M2A',
 'Not assigned',
 'Not assigned',
 'M3A',
 'North York',
 'Parkwoods',
 'M4A']

In [7]:
header

['Postcode', 'Borough', 'Neighbourhood']

In [8]:
# Creates a list of dictionaries where each dictionary corresponds to a row in the table or in the required DataFrame
data_form = []
for i in range(0,len(data),3):
    # Make sure the not assigned Boroughs are dropped
    if not data[i+1] == 'Not assigned':
        # Make sure that not assigned Neighborhoods take the same values as their corresponding Boroughs
        if data[i+2] == 'Not assigned':
            data[i+2] = data[i+1]
        d = dict(zip(header, data[i:i+3]))
        data_form.append(d)

In [9]:
# Create a DataFrame, where the columns titles match the requirement
import pandas as pd

data_df = pd.DataFrame(data_form)
data_df.rename(columns = {'Neighbourhood':'Neighborhood', 'Postcode':'PostalCode'}, inplace=True)
data_df = data_df[['PostalCode', 'Borough', 'Neighborhood']]
data_df.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


### An important assumption here is that the postal codes are alphabetically ordered

In [10]:
# Ged rid of repeated postal codes by appending the Neighborhoods as required
repeat_ind = []
for i in range(0,len(data_df)-1):
    if data_df.iloc[i,0] == data_df.iloc[i+1,0]:
        data_df.iloc[i+1, 2] = '{}, {}'.format(data_df.iloc[i+1,2],data_df.iloc[i,2])
        repeat_ind.append(i)
data_df.drop(repeat_ind, inplace=True)

In [11]:
# Fix the numbering of the rows after dropping the repeated postal codes
data_df.reset_index(inplace = True)
data_df.drop(columns='index', inplace=True)
data_df.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


#### Display the shape of the DataFrame

In [12]:
data_df.shape

(103, 3)