# Segmenting and Clustering Toronto Neighborhoods

## Creating the Dataframe

To create the dataframe, we first scrape the source code of this Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [1]:
#Install the relevant packages:
!python -m pip install beautifulsoup4
!python -m pip install requests
!python -m pip install lxml

#Import the relevant packages:
import pandas as pd
from bs4 import BeautifulSoup
import requests




In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text #scrapes the source code
soup = BeautifulSoup(source, 'lxml') #reads the source code

Now that we have the source code, we can isolate the source code of the table that we are interested in.

In [3]:
table = soup.find('table') #isolates the source code of the table

In [4]:
body = table.find_all('tr') #converts source code of table into a list of source code of each row

With the table's source code, we can begin formatting the data into something more workable.

In [5]:
t_headings = [] #creates empty list to be populated with table headings
for th in body[0].find_all('th'):
    t_headings.append(th.text.replace('\n', ' ').strip()) #populates headings list with table headings
    
table_data = [] #creates empty list to be populated with table data
for tr in table.find_all('tr')[1:]:
    t_row = {} #creates empty dictionary to be populated with each row of data
    for td, th in zip(tr.find_all('td'), t_headings):
        t_row[th] = td.text.replace('\n', ' ').strip() #populates dictionary with data
    table_data.append(t_row) #populates data list with each row of data

In [6]:
nb_list = pd.DataFrame(table_data) #converts scraped data into Pandas dataframe

Now that we have a Pandas dataframe, we can being formatting and cleaning the data as per the parameters of the assignment.

In [7]:
nb_list = nb_list[['Postal Code', 'Borough', 'Neighbourhood']] #rearrange the columns

In [8]:
nb_list = nb_list[nb_list.Borough != 'Not assigned'].reset_index(drop = True) #drops all postal codes whose boroughs are not assigned

In [9]:
nb_list.shape #get the dimensions of the dataframe

(103, 3)

In [10]:
nb_list['Postal Code'].unique().shape #checks for duplicate postal codes

(103,)

Because there are no duplicate postal codes in the data frame, no further cleaning is needed

In [11]:
nb_list[nb_list.Neighbourhood == 'Not assigned'].shape #checks for unassigned neighbourhoods

(0, 3)

Because there are no neighbourhoods that are unassigned, no further cleaning is needed

All data has been collected and formatted as per the parameters of the assignment, so we will display the first twelve rows of the dataframe as well as its dimensions.

In [12]:
nb_list.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [13]:
nb_list.shape

(103, 3)