# Segmenting and Clustering Neighborhoods in Toronto

## Data Mining

Decription how to mine the data on Wikipedia page. Building the code to scrape the Wikipedia page. Writing the data into the DataFrame.

### Scraping

* __Use the Notebook to build the code to scrape the following [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M "List of postal codes of Canada: M").__

Loading necessary libraries. Except <code>pandas</code>, both <code>request</code> and <code>lxml</code> need installation.

In [647]:
import pandas as pd



try:
    import requests
except ModuleNotFoundError:
    ! conda install requests
finally:
    import requests
    
try:
    import lxml.html as lh
except ModuleNotFoundError:
    ! conda install lxml
finally:
    from lxml import html as lh

The code below allows us to get the data from the table from the [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M "List of postal codes of Canada: M").

In [648]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'



#Create a handle, page, to handle the contents of the website.
page = requests.get(url)

#Store the contents of the website under doc.
doc = lh.fromstring(page.content)

#Parse data that are stored between <tr>..</tr> of HTML.
tr_elements = doc.xpath('//tr')

Check the length of the first 12 rows.

In [649]:
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

Looks like all our rows have exactly 3 columns. This means all the data collected on <code>tr_elements</code> are from the table. Next, parse the first row as our header.

Making a scraper.

In [650]:
def preparator(tr_elements):
    col=[]
    i=0
    #For each row, store each first element (header) and an empty list.
    for t in tr_elements[0]:
        i+=1
        name=t.text_content()
        col.append((name,[]))
    #Since out first row is the header, data is stored on the second row onwards.
    for j, _ in enumerate(tr_elements, 1):
        #T is our j'th row
        T=tr_elements[j]

        #If row is not of size 10, the //tr data is not from our table. 
        if len(T)!=3:
            break

        #i is the index of our column.
        i=0

        #Iterate through each element of the row.
        for t in T.iterchildren():
            data=t.text_content() 
            # Check if row is empty.
            if i>0:
            #Convert any numerical value to integers.
                try:
                    data=int(data)
                except:
                    pass
            #Append the data to the empty list of the i'th column.
            col[i][1].append(data)
            #Increment i for the next column.
            i+=1
    return col

In [651]:
col = preparator(tr_elements)

Scraping is complited.

### Data Cleaning

Forming the DataFrame from scraped data. Examining data to find out: excess sysmbols, missing data or something else in string data of DataDrame.
Changing headers if it is necessary and ect. 

* __The dataframe will consist of three columns: <span style="color:red">'PostalCode'</span>, <span style="color:red">'Borough'</span>, and <span style="color:red">'Neighborhood'</span>__

Examining number of columns and length of each column.

In [652]:
print(len(col), [len(C) for (title,C) in col])

3 [287, 287, 287]


Creating DataFrame.

In [653]:
dictionary={title:column for (title,column) in col}
df=pd.DataFrame(dictionary)

df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n
5,M6A,North York,Lawrence Heights\n
6,M6A,North York,Lawrence Manor\n
7,M7A,Downtown Toronto,Queen's Park\n
8,M8A,Not assigned,Not assigned\n
9,M9A,Queen's Park,Not assigned\n


In [654]:
df.tail()

Unnamed: 0,Postcode,Borough,Neighbourhood
282,M8Z,Etobicoke,Mimico NW\n
283,M8Z,Etobicoke,The Queensway West\n
284,M8Z,Etobicoke,Royal York South West\n
285,M8Z,Etobicoke,South of Bloor\n
286,M9Z,Not assigned,Not assigned\n


Renaming column <span style="color:red">'Postcode'</span> to <span style="color:red">'PostalCode'</span>.

In [655]:
df.rename(columns={'Postcode': 'PostalCode', 'Neighbourhood\n': 'Neighbourhood'}, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


Check if the all names in the column <span style="color:red">'Neighbourhood'</span> have a 'dirty' tail like as <span style="color:red">'\n'</span>.

In [656]:
len(pd.unique(df['Neighbourhood'].apply(lambda s: s[-1:])))

1

Yes. Each of them have identical "tin can tied to them legs". Since we found out it, we can apply very simple rule to cut off this "tin can".

In [657]:
df['Neighbourhood'] = df.Neighbourhood.apply(lambda s: s[:-1])
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Now, DataFrame is consist of three columns: <span style="color:red">'PostalCode'</span>, <span style="color:red">'Borough'</span>, and <span style="color:red">'Neighborhood'</span>.

* __Only process the cells that have an assigned <span style="color:red">'Borough'</span>. Ignore cells with a <span style="color:red">'Borough'</span> that is 'Not assigned'.__

Leave rows which have no missing data in <span style="color:red">'Borough'</span> column.

In [658]:
df = df[df['Borough'] != 'Not assigned']
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Reset the indexes. They must be started from 0.

In [659]:
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


* __If a cell has a <span style="color:red">'Borough'</span> but a 'Not assigned' <span style="color:red">'Neighbourhood'</span>, then the <span style="color:red">'Neighbourhood'</span> content will be the same as the <span style="color:red">'Borough'</span>.__ 

Checking it.

In [660]:
test = df[df['Neighbourhood'] == 'Not assigned']
test.loc[:, 'Neighbourhood'] = test.loc[:, 'Borough']
test.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
6,M9A,Queen's Park,Queen's Park


Fix it according to instructions.

In [661]:
df[df['Neighbourhood'] == 'Not assigned'] = test.loc[:, :] 
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Queen's Park,Queen's Park
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


* __More than one neighborhood can exist in one postal code area. These names will be combined into one row with the neighborhoods separated with a comma.__

Converting consist of <span style="color:red">'Neighbourhood'</span> column to the <code><span style="color:green">list</span></code> object, then grouping <span style="color:red">'Neighbourhood'</span> by <span style="color:red">'PostalCode'</span> and finally apply <code>sum()</code> for joining the grouped lists all together.

In [662]:
df['Neighbourhood'] = df.Neighbourhood.apply(lambda s: [s])
pc = df[['PostalCode', 'Neighbourhood']]
pc = pc.groupby(['PostalCode']).sum()
pc.head()

Unnamed: 0_level_0,Neighbourhood
PostalCode,Unnamed: 1_level_1
M1B,"[Rouge, Malvern]"
M1C,"[Highland Creek, Rouge Hill, Port Union]"
M1E,"[Guildwood, Morningside, West Hill]"
M1G,[Woburn]
M1H,[Cedarbrae]


Releasing  the <span style="color:red">'Neighbourhood'</span> content out  of squares.

In [663]:
pc['Neighbourhood'] = pc.Neighbourhood.apply(lambda l: ', '.join(l))
pc.reset_index(inplace=True)
pc.head()

Unnamed: 0,PostalCode,Neighbourhood
0,M1B,"Rouge, Malvern"
1,M1C,"Highland Creek, Rouge Hill, Port Union"
2,M1E,"Guildwood, Morningside, West Hill"
3,M1G,Woburn
4,M1H,Cedarbrae


Check length equality of DataFrames which must be join together.

In [664]:
df.shape == pc.shape

False

Since the <code>.shape</code> of <code>df</code> and <code>pd</code> DataFrames is not the same, there is needs another DataFrame for joining <code>df</code> and <code>pd</code> DataFrames. 

Syntezing the DataFrame <code>clear_df</code> from scratch.

In [665]:
clear_df = pd.DataFrame()
clear_df['PostalCode'] = pd.unique(df.PostalCode)
clear_df['Borough'] = False
clear_df['Neighbourhood'] = False
clear_df = clear_df.set_index('PostalCode')
clear_df.head()

Unnamed: 0_level_0,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,False,False
M4A,False,False
M5A,False,False
M6A,False,False
M7A,False,False


Loading the data to the <code>clear_df</code> DataFrame.

In [666]:
df['Neighbourhood'] = df.Neighbourhood.apply(lambda s: s[0])
for pcd, b, _ in df.values:
    clear_df.loc[pcd, 'Borough'] = b
for pcd, nb in pc.values:
    clear_df.loc[pcd, 'Neighbourhood'] = nb
clear_df.reset_index().head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Queen's Park,Queen's Park
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


Let ensure that in an original DataFrame <code>df</code> it won't happen that one postal code will the same for two or more different boroughs.

In [667]:
test = clear_df.Borough.apply(lambda l: l.split(','))
test = test.apply(lambda l: len(l))
print('Number of distinct boroughs corresponding to each postal code not greater than {}.'.format(test.max()))

Number of distinct boroughs corresponding to each postal code not greater than 1.


In [668]:
p_codes = ['M5G', 'M2H', 'M4B', 'M1J', 'M4G', 'M4M', 'M1R', 'M9V', 'M9L', 'M5V', 'M1B', 'M5A']
clear_df.loc[p_codes, :].reset_index()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Woodbine Gardens, Parkview Hill"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Maryvale, Wexford"
7,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo..."


It seems I made a mistake: 'M5A' must has two neighborhoods 'Regent Park' and 'Harbourfron', but there is 'Harbourfron' only. What about postal code 
'M5A', how many rows in <code>df</code> are containing it

In [669]:
df[df['PostalCode'] == 'M5A']

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M5A,Downtown Toronto,Harbourfront


Only one row! It is not my mistake. Let ensure by visit [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M "List of postal codes of Canada: M").

In [670]:
len(df[df['Neighbourhood'] == 'Regent Park'])

0

There is 0 rows which has 'Regent Park' as a neighborhood.

* __In the last cell of your notebook, use the <code>.shape</code> method to print the number of rows of your dataframe.__

In [671]:
clear_df.shape

(103, 2)

### Call to Get the Coordinates

I couldn't installed <code>geocode</code> package neither by <code>conda</code> nor by <code>pip</code> package managers. I found another solution, it is a <code>pgeocode</code> package, which I could install only by <code>pip</code> package manager. It's works better than the <code>geocode</code> because of the response is always valid.

Creating the coordinate reaper.

In [672]:
try:
    import pgeocode
except ModuleNotFoundError:
        ! pip install pgeocode
finally:
    try:
        import pgeocode
    except ModuleNotFoundError:
        print('pgeocode isn\'t installed.')

    

nomi = pgeocode.Nominatim('CA') #Country code of Canada is 'CA'

According to the [Project discription](https://pypi.org/project/pgeocode/) of the <code>pgeocode</code> in the section __Quickstart__, prepare a list of
postal codes, which is in the <span style="color:red">'PostalCode'</span> column of <code>clean_df</code>. Since we have read the section __Geocoding format__ of the doc, then we know the result of a geo-localistion query is a <code>pandas.DataFrame</code>.

In [673]:
#Prepare list of the postal codes.
clear_df.reset_index(inplace=True)
postal_codes = list(clear_df['PostalCode'])

#Call to get a data.
geodata = nomi.query_postal_code(postal_codes)
geodata.head()

Unnamed: 0,postal_code,country code,place_name,state_name,state_code,county_name,county_code,community_name,community_code,latitude,longitude,accuracy
0,M3A,CA,North York (York Heights / Victoria Village / ...,Ontario,ON,North York,,,,43.7545,-79.33,1.0
1,M4A,CA,North York (Sweeney Park / Wigmore Park),Ontario,ON,,,,,43.7276,-79.3148,6.0
2,M5A,CA,Downtown Toronto (Regent Park / Port of Toronto),Ontario,ON,Toronto,8133394.0,,,43.6555,-79.3626,6.0
3,M6A,CA,North York (Lawrence Manor / Lawrence Heights),Ontario,ON,North York,,,,43.7223,-79.4504,6.0
4,M7A,CA,Queen's Park Ontario Provincial Government,Ontario,ON,,,,,43.6641,-79.3889,


Since we need only latitude and longitude data not necessary to care about the missing entries of the others features. There is 103 rows in this DataFrame. As we can see there is one missed value in both <span style="color:red">'latitude'</span> and <span style="color:red">'longitude'</span> columns.

In [674]:
geodata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 0 to 102
Data columns (total 12 columns):
postal_code       103 non-null object
country code      102 non-null object
place_name        102 non-null object
state_name        102 non-null object
state_code        102 non-null object
county_name       98 non-null object
county_code       41 non-null float64
community_name    0 non-null object
community_code    0 non-null float64
latitude          102 non-null float64
longitude         102 non-null float64
accuracy          96 non-null float64
dtypes: float64(5), object(7)
memory usage: 10.5+ KB


Might be that the columns have the missed values in the different rows, then check it separately by independent conditions using logical *AND*.

In [675]:
import numpy as np



geodata.fillna(value='Not assigned', inplace=True)
geodata.loc[(geodata.latitude == 'Not assigned') | (geodata.longitude == 'Not assigned'), :] #Compare two conditions using logical AND-operator present 
                                                                                             #here as a | symbol and put the conditions into their own 
                                                                                             #brackets.

Unnamed: 0,postal_code,country code,place_name,state_name,state_code,county_name,county_code,community_name,community_code,latitude,longitude,accuracy
76,M7R,Not assigned,Not assigned,Not assigned,Not assigned,Not assigned,Not assigned,Not assigned,Not assigned,Not assigned,Not assigned,Not assigned


It's a single row, 76th row. Leave the rows with no missin data  only.

In [676]:
geodata = geodata.loc[(geodata.latitude != 'Not assigned') | (geodata.longitude != 'Not assigned'), :]
geodata.shape

(102, 12)

In [677]:
latlon = geodata[['postal_code', 'latitude', 'longitude']]
latlon.head()

Unnamed: 0,postal_code,latitude,longitude
0,M3A,43.7545,-79.33
1,M4A,43.7276,-79.3148
2,M5A,43.6555,-79.3626
3,M6A,43.7223,-79.4504
4,M7A,43.6641,-79.3889


Joining two DataFrames by hand using third DataFrame is too bored. Let use the <code>.join</code> method provided by <code>pandas</code> package.

In [678]:
clear_df = clear_df.set_index('PostalCode').join(latlon.set_index('postal_code'))
clear_df.reset_index(inplace=True)
clear_df.head(12)

Unnamed: 0,PostalCode,Borough,Neighbourhood,latitude,longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,Harbourfront,43.6555,-79.3626
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.7223,-79.4504
4,M7A,Downtown Toronto,Queen's Park,43.6641,-79.3889
5,M9A,Queen's Park,Queen's Park,43.6662,-79.5282
6,M1B,Scarborough,"Rouge, Malvern",43.8113,-79.193
7,M3B,North York,Don Mills North,43.745,-79.359
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.7063,-79.3094
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.6572,-79.3783


Give the same order to observed slice of the DataFrame.

In [679]:
lat_mask = clear_df.latitude.apply(lambda val: val is not np.nan)
lon_mask = clear_df.longitude.apply(lambda val: val is not np.nan)
clear_df = clear_df[lat_mask | lon_mask]
clear_df.reset_index(inplace=True)

In [680]:
p_codes = ['M5G', 'M2H', 'M4B', 'M1J', 'M4G', 'M4M', 'M1R', 'M9V', 'M9L', 'M5V', 'M1B', 'M5A']
clear_df.set_index('PostalCode').loc[p_codes, :]

Unnamed: 0_level_0,index,Borough,Neighbourhood,latitude,longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
M5G,24,Downtown Toronto,Central Bay Street,43.6564,-79.386
M2H,27,North York,Hillcrest Village,43.8015,-79.3577
M4B,8,East York,"Woodbine Gardens, Parkview Hill",43.7063,-79.3094
M1J,32,Scarborough,Scarborough Village,43.7464,-79.2323
M4G,23,East York,Leaside,43.7124,-79.3644
M4M,54,East Toronto,Studio District,43.6561,-79.3406
M1R,71,Scarborough,"Maryvale, Wexford",43.7507,-79.3003
M9V,89,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.7432,-79.5876
M9L,50,North York,Humber Summit,43.7598,-79.5565
M5V,87,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo...",43.6404,-79.3995


### Visualization Neighborhoods on the Map

Importing <code>folium</code> and <code>geopandas</code> packages.

In [681]:
response = []

try:
    import folium
except ModuleNotFoundError:
        !conda install -c conda-forge folium=0.5.0 --yes
finally:
    try:
        import folium
    except ModuleNotFoundError:
        response.append('folium isn\'t installed.')

try:
    import geopandas as gpd
except ModuleNotFoundError:
        !conda install geopandas
finally:
    try:
        import geopandas as gpd
    except ModuleNotFoundError:
        response.append('geopandas isn\'t installed.')

        

if response:
    print(response)
else:
    print('OK')

OK


Display the map of Canada.

In [682]:
#Set Canada latitude and longitude values.
latitude = 55.585901
longitude = -105.750596

#Create map and display it.
canada_map = folium.Map(location=[latitude, longitude], zoom_start=12)

#Display the map of San Francisco.
canada_map

But we interest such part that has coordinates as mean of <code>latitude</code> and of <code>longitude</code> values from our <code>clear_df</code> DataFrame.

In [683]:
mean = clear_df[['latitude', 'longitude']].mean().round(4)
latitude = mean['latitude']
longitude = mean['longitude']

print(latitude, longitude)

43.7067 -79.394


In [684]:
canada_map = folium.Map(location=[latitude, longitude], zoom_start=11)
canada_map

There is needs the data of Canadian neighborhoods' borders. According to [this explanation](https://fooobar.com/questions/291160/howwhere-do-i-get-geojson-data-for-states-provinces-and-administrative-regions-of-non-us-countries "Mining GeoData") I got it and convert to the <code>.geojson</code> file with third detalization level. 

In [685]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,AREA_S_CD,AREA_NAME,geometry
0,97,Yonge-St.Clair (97),"POLYGON ((-79.39119 43.68108, -79.39141 43.680..."
1,27,York University Heights (27),"POLYGON ((-79.50529 43.75987, -79.50488 43.759..."
2,38,Lansing-Westgate (38),"POLYGON ((-79.43998 43.76156, -79.44043 43.763..."
3,31,Yorkdale-Glen Park (31),"POLYGON ((-79.43969 43.70561, -79.44011 43.705..."
4,16,Stonegate-Queensway (16),"POLYGON ((-79.49262 43.64744, -79.49277 43.647..."


In [686]:
#Clean names in 'AREA_NAME' column.
import re



pattern = '[a-zA-Z\'\-\s\.]+[^\ \(]'
geo_json['AREA_NAME'] = geo_json['AREA_NAME'].apply(lambda name: re.match(pattern, name).group())
geo_json.head()

Unnamed: 0,AREA_S_CD,AREA_NAME,geometry
0,97,Yonge-St.Clair,"POLYGON ((-79.39119 43.68108, -79.39141 43.680..."
1,27,York University Heights,"POLYGON ((-79.50529 43.75987, -79.50488 43.759..."
2,38,Lansing-Westgate,"POLYGON ((-79.43998 43.76156, -79.44043 43.763..."
3,31,Yorkdale-Glen Park,"POLYGON ((-79.43969 43.70561, -79.44011 43.705..."
4,16,Stonegate-Queensway,"POLYGON ((-79.49262 43.64744, -79.49277 43.647..."


In [687]:
try:
    from shapely.geometry import Point, Polygon
except ModuleNotFoundError:
    !conda install shapely
finally:
    try:
        from shapely.geometry import Point, Polygon
    except ModuleNotFoundError:
        response.append('shapely isn\'t installed.')
        

Since I couln't found necessary <code>.geojson</code> file representing such boroughs and neighborhoods which are in <code>clear_df</code> DataFrame, then I will show other segmentation of Toronto and get color for each part according to how many postal codes it has.

In [688]:
def loader(geo_df):
    
    def distributor(point):
        for name, polygon in geo_df:
            flag = point.within(polygon)
            if flag:
                return name
                break
        return np.nan
    
    return distributor



geo_df = zip(geo_json['AREA_NAME'], geo_json['geometry'])
distributor = loader(geo_df)

latlon = zip(clear_df['latitude'], clear_df['longitude'])
points = map(lambda crd: Point(crd[1], crd[0]), latlon)
points = list(points)

for point in points: #All postal codes locations are not lacated in any array from geo_dg.
    place = distributor(point)
    if place:
        print(place)
        break

Parkwoods-Donalda


Only one match?! Uhh, face-palm.

I am totally disapointed with my useless looking for enough detalized <code>.geojson</code>. I am very tired. I through the whole Google and didn't find it out. Let just mark postal codes points on the map.

In [689]:
# instantiate a feature group for the incidents in the dataframe
postal_codes = folium.map.FeatureGroup()

# loop through the 100 crimes and add each to the incidents feature group
for lat, lng, in zip(clear_df.latitude, clear_df.longitude):
    postal_codes.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='yellow',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6
        )
    )

# add pop-up text to each marker on the map
latitudes = list(clear_df.latitude)
longitudes = list(clear_df.longitude)
labels = list(clear_df.PostalCode)

for lat, lng, label in zip(latitudes, longitudes, labels):
    folium.Marker([lat, lng], popup=label).add_to(canada_map)    
    
# add incidents to map
canada_map.add_child(postal_codes)

I spend aproximatelly three days for finish this assignment. It was really hard and interesting, I felt like a real data scientist. I have remembered almost whole I know about Python packages: from <code>pandas</code> to <code>re</code>. I was solving this assignment for eight to ten hour a day. I was doing only it. May be it is too long time, some data scientisc can do it for a couple of hours, but I was solving it as fast as I could.