# Segmenting and Clustering Neighborhoods in Toronto
In this assignment, we will be exploring, segmenting, and clustering the neighborhoods in the city of Toronto.

## Part 1: Scraping data from Wikipedia webpage

First we will be installing and importing the required libraries

In [None]:
import sys
!conda install --yes --prefix {sys.prefix} beautifulsoup4
!conda install --yes --prefix {sys.prefix} lxml
!conda install --yes --prefix {sys.prefix} html5lib
!conda install --yes --prefix {sys.prefix} requests
!conda install --yes --prefix {sys.prefix} -c conda-forge folium=0.5.0 --yes
!conda install --yes --prefix {sys.prefix} -c conda-forge geopy --yes
print("everything is installed now...")


In [21]:
from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd
import numpy as np
import folium
from geopy.geocoders import Nominatim 
import matplotlib.cm as cm
import matplotlib.colors as colors

# to enable autocomplete in the notebook
%config IPCompleter.greedy=True 

Getting the webpage source from Wikipedia

In [22]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_page = requests.get(url)

Parsing the html content of the webpage using beatifulsoup <br> 
Then extract the table tag that contains the table with neighborhoods data

In [23]:
soup = BeautifulSoup(html_page.content,'lxml')
table = soup.table
# print(table.prettify())

Extract table headers from the table and put them in a list

In [24]:
headers = table.find_all('th')
headers_list = []
for x in headers:
    headers_list.append(x.text)
headers_list[2] = headers_list[2].replace('\n','')
print(headers_list)

['Postcode', 'Borough', 'Neighbourhood']


Extract rows from the table (without headers) and put them in a dataframe: <br>
- first of all we convert the table data to a list that contains every row in the table as an element
- then we extract the table elements one by one and put them in the appropriate place in an empty list 
- finally we convert the list made to a dataframe

In [25]:
content = table.find_all('tr')
del content[0]

# initializing list of neighbourhoods
l = []

# put neighbourhoods in the list one by one (loop over the extracted list that contains rows in it)
for tr in content:
    # convert the extracted row to a list that contains elements of the table as an element in the list
    row = tr.find_all('td')
#     print(row)
    # convert every element in the previous list to the text content (removing the tags from it)
    tmp_lst = [elem.text for elem in row]
    # the next line is to remove the \n (newline) from the last element of the list
    tmp_lst[2] = tmp_lst[2].replace('\n','')
    print(tmp_lst)
    # appending the list to the list of the lists
    l.append(tmp_lst)
#     print(l)
    
df_nbrs = pd.DataFrame(l,columns=headers_list)
print(df_nbrs.shape)
df_nbrs.head(30)

['M1A', 'Not assigned', 'Not assigned']
['M2A', 'Not assigned', 'Not assigned']
['M3A', 'North York', 'Parkwoods']
['M4A', 'North York', 'Victoria Village']
['M5A', 'Downtown Toronto', 'Harbourfront']
['M5A', 'Downtown Toronto', 'Regent Park']
['M6A', 'North York', 'Lawrence Heights']
['M6A', 'North York', 'Lawrence Manor']
['M7A', "Queen's Park", 'Not assigned']
['M8A', 'Not assigned', 'Not assigned']
['M9A', 'Etobicoke', 'Islington Avenue']
['M1B', 'Scarborough', 'Rouge']
['M1B', 'Scarborough', 'Malvern']
['M2B', 'Not assigned', 'Not assigned']
['M3B', 'North York', 'Don Mills North']
['M4B', 'East York', 'Woodbine Gardens']
['M4B', 'East York', 'Parkview Hill']
['M5B', 'Downtown Toronto', 'Ryerson']
['M5B', 'Downtown Toronto', 'Garden District']
['M6B', 'North York', 'Glencairn']
['M7B', 'Not assigned', 'Not assigned']
['M8B', 'Not assigned', 'Not assigned']
['M9B', 'Etobicoke', 'Cloverdale']
['M9B', 'Etobicoke', 'Islington']
['M9B', 'Etobicoke', 'Martin Grove']
['M9B', 'Etobicoke',

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


Ignore cells with a borough that is <b>"Not assigned"

In [26]:
indexNames = df_nbrs[ df_nbrs['Borough'] == "Not assigned" ].index
df_nbrs.drop(indexNames , inplace=True)
df_nbrs.reset_index(inplace=True)
df_nbrs.drop("index",axis=1, inplace=True)
print(df_nbrs.shape)
df_nbrs.head(30)

(211, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


In the next cell, I am grouping the dataframe by the Postcode column to join the Neighbourhoods that have the same Postcode

In [27]:
grouped = df_nbrs.groupby("Postcode").agg([','.join])
final_df = grouped.reset_index().droplevel(1,axis=1)
final_df.head(20)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,"Scarborough,Scarborough","Rouge,Malvern"
1,M1C,"Scarborough,Scarborough,Scarborough","Highland Creek,Rouge Hill,Port Union"
2,M1E,"Scarborough,Scarborough,Scarborough","Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,"Scarborough,Scarborough,Scarborough","East Birchmount Park,Ionview,Kennedy Park"
7,M1L,"Scarborough,Scarborough,Scarborough","Clairlea,Golden Mile,Oakridge"
8,M1M,"Scarborough,Scarborough,Scarborough","Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,"Scarborough,Scarborough","Birch Cliff,Cliffside West"


In the next cell I am removing the duplicates in every row in Borough column in the previous result

In [28]:
for i in range(0,final_df.shape[0]):
    final_df["Borough"].iloc[i] = final_df["Borough"].iloc[i].split(',')[0]
final_df.head(20)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


Handling cases where Neighbourhood column is Not assigned

In [29]:
for i in final_df[final_df["Neighbourhood"]=="Not assigned"].index:
        final_df["Neighbourhood"].iloc[i] = final_df["Borough"].iloc[i]
# showing the example mentioned (9th row in wikipedia page)
final_df[final_df["Postcode"]=="M7A"]

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park


Showing that there is no rows with "Not assigned" Borough

In [30]:
final_df[final_df["Borough"]=="Not assigned"]

Unnamed: 0,Postcode,Borough,Neighbourhood


Showing that there is no duplicates at column Postcode, meaning that all Neighbourhoods with the same Postcode were combined into one row

In [31]:
final_df[final_df["Postcode"].duplicated()]

Unnamed: 0,Postcode,Borough,Neighbourhood


Showing that there is no Neighbourhood == Not assigned

In [32]:
final_df[final_df["Neighbourhood"]=="Not assigned"]

Unnamed: 0,Postcode,Borough,Neighbourhood


In [33]:
final_df.shape

(103, 3)

## Part 2:


Reading the CSV file that contains the coordinates of the nieghbourhoods

In [34]:
coord_df = pd.read_csv("https://cocl.us/Geospatial_data")
coord_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Rename the "Postal Code" column to "Postcode" so that we could merge the coordinates data with the dataframe that contains the neighbourhoods extracted in Part 1

In [35]:
coord_df.rename(columns={'Postal Code':'Postcode'}, inplace=True)
coord_df.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merging the coordinates in the dataframe that contains the neighbourhoods extracted in Part 1

In [36]:
combined_df = pd.merge(final_df, coord_df, how='left',
        left_on='Postcode', right_on='Postcode')
combined_df.head()


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


the next cell is used to check that the merge is done correctly by testing the examples mentioned in the assignment 

In [37]:
combined_df[combined_df["Postcode"]=="M5A"]

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
53,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636


# Part 3: Explore and Cluster the data 

Preview the combined dataframe that contains neighbourhood data with coordinates data 

In [38]:
combined_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Get the address of toronto

In [39]:
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


Display the map of toronto with neighbourhoods circled on it <br>
Neighbourhoods that exist in the same borough has the same color

In [40]:
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

unique_borough = combined_df.Borough.unique() 
kclusters = len(unique_borough)
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to map
for lat, lng, borough, neighborhood in zip(combined_df['Latitude'], combined_df['Longitude'], combined_df['Borough'], combined_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[np.where(unique_borough==borough)[0][0]-1],
        fill=True,
        fill_color=rainbow[np.where(unique_borough==borough)[0][0]-1],
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto