# Segmentation and Clustering Neighborhoods in Toronto

Firstly, we import the pandas library and we extract the table from the wikipedia article. Here we have the firsts rows..

In [1]:
from geopy.geocoders import Nominatim 
!conda install -c conda-forge folium=0.5.0 --yes 
import folium
import pandas as pd
 
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]
df.to_csv('beautifulsoup_pandas.csv')

df = df[1:]
df.rename(columns={0:'Postcode', 1:'Borough', 2:'Neighborhood'}, inplace=True)

df.head()

Collecting package metadata: ...working... done
Solving environment: ...working... done

# All requested packages already installed.



Unnamed: 0,Postcode,Borough,Neighborhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


Let's see how big is our data set.

In [2]:
df.shape

(289, 3)

We create a new empty data frame where we will store the clean data.

In [3]:
new_df = pd.DataFrame(columns = df.columns)
new_df.head()

Unnamed: 0,Postcode,Borough,Neighborhood


We loop over the original data set to clean the data and add it to our new data frame.

In [4]:
for index, element in df.iterrows():
    if element['Borough'] != 'Not assigned':
        if element['Postcode'] in new_df['Postcode'].unique():
             for index_,element_ in new_df.iterrows():
                    if (element_['Postcode'] == element['Postcode']) and (element['Neighborhood'] not in element_['Neighborhood']):
                        element_['Neighborhood'] = element_['Neighborhood'] + ', ' + element['Neighborhood']
        else:
            new_df.loc[len(new_df)] = element
            
for index, element in new_df.iterrows():
    if element['Neighborhood'] == 'Not assigned':
        element['Neighborhood'] = element['Borough']

Let's see the first rows of our new data frame.

In [5]:
new_df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


Let's take a look to the size of our new data frame.

In [6]:
new_df.shape

(103, 3)

We decided to use the csv file to get the latitudes and longitudes. Let's transform our data into a pandas' data frame.

In [7]:
df_ = pd.read_csv('https://cocl.us/Geospatial_data')
df_.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We introduce these new data into our data frame and we print the firsts rows of our final data frame.

In [8]:
df_copy = pd.DataFrame(columns = ['Latitude', 'Longitude'])
for index, element in new_df.iterrows():
    for index_, element_ in df_.iterrows():
        if element['Postcode'] == element_['Postal Code']:
             df_copy.loc[len(df_copy)] = [element_['Latitude'], element_['Longitude']]
new_df['Latitude'] = df_copy['Latitude']
new_df['Longitude'] = df_copy['Longitude']
new_df.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


We get the coordinates of Toronto.

In [9]:
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.653963, -79.387207.


We create a map of Toronto with neighborhoods superimposed on top.

In [10]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(new_df['Latitude'], new_df['Longitude'], new_df['Borough'], new_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

For illustration purposes, let's simplify the above map and segment and cluster only the boroughs that contain the word Toronto. So let's slice the original dataframe and create a new dataframe of the "Toronto" data.

In [11]:
toronto_data = pd.DataFrame(columns = ['Postcode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'])

for index, element in new_df.iterrows():
    if 'Toronto' in element['Borough']:
        toronto_data.loc[len(toronto_data)] = element
toronto_data.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
1,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


In [12]:
map_toronto_ = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_)  
    
map_toronto