## Segmenting and Clustering Neighborhoods in Toronto #2

In [1]:
# importing the library and pulling the Wikipedia article
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [2]:
# importing the library for scraping the website content
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
# checking the content
#print(soup.prettify())

We will be exploring the HTML tags and extracting the table data. First, we have to locate the right table.

In [3]:
# Using BeautifulSoup to obtain the data that is in the table of postal codes
My_table = soup.find('table',{'class':'wikitable sortable'})
#My_table


As we can see, the tags we need are "tr" and "td". Let's start extracting the table data. Each table column will be stored in the separate list.

In [4]:
A=[]
B=[]
C=[]

for row in My_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True)[:-1])
        B.append(cells[1].find(text=True)[:-1])
        C.append(cells[2].find(text=True)[:-1])
#C

Next, we will transform the data into a pandas dataframe with three columns. 
After that we'll do some data cleanup by removing the obsolete entries.

In [5]:
import pandas as pd
df = pd.DataFrame()
df['PostalCodes'] = A
df['Borough'] = B
df['Neighbourhood'] = C

df = df[~df['Borough'].isin(['Not assigned'])]
df.reset_index(drop = True, inplace = True)

df.head()

Unnamed: 0,PostalCodes,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Next, we'll check the dataframe columns and data:

In [6]:
print(df.columns)
df.head(20)

Index(['PostalCodes', 'Borough', 'Neighbourhood'], dtype='object')


Unnamed: 0,PostalCodes,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Finally, let's check the dataframe dimensions.

In [7]:
df.shape

(103, 3)

Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. We will use the csv file because it is provided.

In [8]:
csv_path = "Downloads/Geospatial_Coordinates.csv"
dfGeo = pd.read_csv(csv_path)
# rename column to align between two dataframes
dfGeo.rename(columns = {'Postal Code':'PostalCodes'}, inplace = True) 
dfGeo.head()

Unnamed: 0,PostalCodes,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We merge two frames on column "PostalCodes" in order to create a single table, then we sort it by postal code.

In [9]:
df = pd.merge(df, dfGeo, on="PostalCodes")
print(df.columns)
df.sort_index(axis = 1) 
df.head()

Index(['PostalCodes', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude'], dtype='object')


Unnamed: 0,PostalCodes,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Since the google geocoder didn't work well, trying another geocoder:

In [15]:
#!pip install geopy
#!pip install geocoder
#import geopy # 
#import geocoder
#from geopy.geocoders import Nominatim

#!pip install pgeocode
import pgeocode
nomi = pgeocode.Nominatim('ca')

geoA=[]
geoB=[]
geoC=[]

print("Start!")

# initialize the counter
i = 0

for postal_code in df['PostalCodes']:
    print("Processing postal code {} out of 103".format(i) + " -> {}".format(postal_code))
    location = nomi.query_postal_code(postal_code)
    geoA.append(postal_code)
    geoB.append(location.latitude)
    geoC.append(location.longitude)
    print(postal_code+", "+str(location.latitude)+", "+str(location.longitude))
    i = i + 1

print("Done!")

Start!
Processing postal code 0 out of 103 -> M3A
M3A, 43.7545, -79.33
Processing postal code 1 out of 103 -> M4A
M4A, 43.7276, -79.3148
Processing postal code 2 out of 103 -> M5A
M5A, 43.6555, -79.3626
Processing postal code 3 out of 103 -> M6A
M6A, 43.7223, -79.4504
Processing postal code 4 out of 103 -> M7A
M7A, 43.6641, -79.3889
Processing postal code 5 out of 103 -> M9A
M9A, 43.6662, -79.5282
Processing postal code 6 out of 103 -> M1B
M1B, 43.8113, -79.193
Processing postal code 7 out of 103 -> M3B
M3B, 43.745, -79.359
Processing postal code 8 out of 103 -> M4B
M4B, 43.7063, -79.3094
Processing postal code 9 out of 103 -> M5B
M5B, 43.6572, -79.3783
Processing postal code 10 out of 103 -> M6B
M6B, 43.7081, -79.4479
Processing postal code 11 out of 103 -> M9B
M9B, 43.6505, -79.5517
Processing postal code 12 out of 103 -> M1C
M1C, 43.7878, -79.1564
Processing postal code 13 out of 103 -> M3C
M3C, 43.7334, -79.3329
Processing postal code 14 out of 103 -> M4C
M4C, 43.6913, -79.3116
Pro

Merging the lists into the dataframe

In [17]:
import pandas as pd
dfGeoAlt = pd.DataFrame()

dfGeoAlt['PostalCodes'] = geoA
dfGeoAlt['Latitude'] = geoB
dfGeoAlt['Longitude'] = geoC
dfGeoAlt.head()

Unnamed: 0,PostalCodes,Latitude,Longitude
0,M3A,43.7545,-79.33
1,M4A,43.7276,-79.3148
2,M5A,43.6555,-79.3626
3,M6A,43.7223,-79.4504
4,M7A,43.6641,-79.3889


In [18]:
dfGeoAlt.shape

(103, 3)

Therefore, we were able to use both csv file and geocoder package to get the geographical coordinates of the neighborhoods and create the dataframes. This concludes notebook 2.