## Part I

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

In [1]:
import requests
import bs4
import pandas as pd
import re

In [2]:
url=r'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
res=requests.get(url)
Ca_PC=bs4.BeautifulSoup(res.text)
elems=Ca_PC.select('table')
items=[item for item in elems[0].getText().split('\n') if item != '']

In [3]:
codes=[]
neig=[]
for i in range(len(items)-1):
    if re.match('M[0-9][A-Z]',items[i]) and items[i+1]!='Not assigned':
        codes.append(items[i])
        neig.append([items[i],(items[i+1]).replace(r'/',','),(items[i+2]).replace(r'/',',')])

In [4]:
PostalCodes = pd.DataFrame(neig,columns=['Postal Code','Borough','Neighborhood'])
PostalCodes.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"


In [5]:
PostalCodes.shape

(103, 3)

## Part II

In [6]:
import geocoder

lat_lng_coords = None
i=0
while(lat_lng_coords is None and i<=50):
    g = geocoder.google('{}, Toronto, Ontario'.format('M5G'))
    lat_lng_coords = g.latlng
    i+=1
if lat_lng_coords is None:
    print('failed!')
else:
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]

Status code Unknown from https://maps.googleapis.com/maps/api/geocode/json: ERROR - HTTPSConnectionPool(host='maps.googleapis.com', port=443): Max retries exceeded with url: /maps/api/geocode/json?address=M5G%2C+Toronto%2C+Ontario&bounds=&components=&region=&language= (Caused by ProxyError('Cannot connect to proxy.', timeout('select timed out')))


failed!


**Since the package doesn't work, I use the csv file instead.**

In [9]:
df=pd.read_csv('Geospatial_Coordinates.csv')
df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [10]:
PostalCodes=pd.merge(PostalCodes,df,how='left',on='Postal Code')
PostalCodes.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.662301,-79.389494


In [11]:
PostalCodes.shape

(103, 5)

## Part III

I decide to cluster all the postal code based on how many neighborhood they have and what borough they belong to.

In [22]:
test_PostalCodes=PostalCodes.copy()

In [23]:
for i in range(len(test_PostalCodes)):
    test_PostalCodes.loc[i,'Num']=len(test_PostalCodes.iloc[i,2].split(','))
test_PostalCodes.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Num
0,M3A,North York,Parkwoods,43.753259,-79.329656,1.0
1,M4A,North York,Victoria Village,43.725882,-79.315572,1.0
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636,2.0
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763,2.0
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.662301,-79.389494,2.0


In [26]:
test_PostalCodes.groupby('Borough').count()

Unnamed: 0_level_0,Postal Code,Neighborhood,Latitude,Longitude,Num
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Central Toronto,9,9,9,9,9
Downtown Toronto,19,19,19,19,19
East Toronto,5,5,5,5,5
East York,5,5,5,5,5
Etobicoke,12,12,12,12,12
Mississauga,1,1,1,1,1
North York,24,24,24,24,24
Scarborough,17,17,17,17,17
West Toronto,6,6,6,6,6
York,5,5,5,5,5


In [29]:
# one hot encoding
PostalCodes_onehot = pd.get_dummies(test_PostalCodes[['Borough']], prefix="", prefix_sep="")

# add num and Postal Code column back to dataframe
PostalCodes_onehot['Num'] = test_PostalCodes['Num']
PostalCodes_onehot['Postal Code'] = test_PostalCodes['Postal Code']

# move neighborhood column to the first column
fixed_columns = [PostalCodes_onehot.columns[-1]] + list(PostalCodes_onehot.columns[:-1])
PostalCodes_onehot = PostalCodes_onehot[fixed_columns]

PostalCodes_onehot.head()

Unnamed: 0,Postal Code,Central Toronto,Downtown Toronto,East Toronto,East York,Etobicoke,Mississauga,North York,Scarborough,West Toronto,York,Num
0,M3A,0,0,0,0,0,0,1,0,0,0,1.0
1,M4A,0,0,0,0,0,0,1,0,0,0,1.0
2,M5A,0,1,0,0,0,0,0,0,0,0,2.0
3,M6A,0,0,0,0,0,0,1,0,0,0,2.0
4,M7A,0,1,0,0,0,0,0,0,0,0,2.0


In [30]:
PostalCodes_onehot.shape

(103, 12)

Run k-means to cluster the Postal Code into 5 clusters.

In [32]:
#MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

PostalCodes_onehot_clustering = PostalCodes_onehot.drop('Postal Code', 1)
scaler = MinMaxScaler()
scaler.fit(PostalCodes_onehot_clustering)
trans_PostalCodes_onehot_clustering = scaler.transform(PostalCodes_onehot_clustering)
print(trans_PostalCodes_onehot_clustering)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         1.         0.         ... 0.         0.         0.14285714]
 ...
 [0.         0.         1.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]
 [0.         0.         0.         ... 0.         0.         0.57142857]]


In [35]:
from sklearn.cluster import KMeans 
# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(init="k-means++",n_clusters=kclusters, n_init=12).fit(trans_PostalCodes_onehot_clustering)

# check cluster labels generated for each row in the dataframe
k_means_labels = kmeans.labels_
k_means_labels

array([0, 0, 2, 0, 2, 1, 3, 0, 4, 2, 0, 1, 3, 0, 4, 2, 4, 1, 3, 4, 2, 4,
       3, 4, 2, 2, 3, 0, 0, 4, 2, 4, 3, 0, 0, 4, 2, 4, 3, 0, 0, 4, 2, 4,
       3, 0, 0, 4, 2, 0, 0, 3, 0, 0, 4, 0, 4, 0, 3, 0, 0, 4, 4, 4, 4, 3,
       0, 4, 4, 4, 1, 3, 0, 4, 4, 4, 4, 1, 3, 4, 2, 4, 3, 4, 2, 3, 4, 2,
       1, 1, 3, 2, 2, 1, 1, 3, 2, 2, 1, 2, 4, 1, 1])

Create a map of Toronto with clustered postal codes.

In [36]:
test_PostalCodes['group']=k_means_labels
test_PostalCodes.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Num,group
0,M3A,North York,Parkwoods,43.753259,-79.329656,1.0,0
1,M4A,North York,Victoria Village,43.725882,-79.315572,1.0,0
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636,2.0,2
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763,2.0,0
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.662301,-79.389494,2.0,2


In [39]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

address = 'Toronto, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [46]:
import folium # map rendering library

# create map of New York using latitude and longitude values
map_ = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
colors=['#000033','#009900','#ff0000','#9966cc','#ccffcc','#ff3300']
for lat, lng, PostalCode, group in zip(test_PostalCodes['Latitude'], test_PostalCodes['Longitude'], test_PostalCodes['Postal Code'], test_PostalCodes['group']):
    label = '{},({}, {})-group{}'.format(PostalCode,lat, lng,group)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=colors[group],
        fill=True,
        fill_color=colors[group],
        fill_opacity=0.7,
        parse_html=False).add_to(map_)  
    
map_