# Segmenting and Clustering Neighborhoods in Toronto

The project includes getting DataFrame with Pandas from the Wikipedia page for the postal codes of Canada and then process and clean the data for the clustering. 

The clustering is carried out by K Means and the clusters are plotted using the Folium Library. 

The Boroughs containing the name 'Toronto' in it are first plotted and then clustered and plotted again.

## 1- Getting and Cleaning Data

In [1]:
import pandas as pd
import numpy as np

#### Wikipedia table datas can be easily parsed in one line of code with pandas!

In [2]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

In [3]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


The dataframe consists of three columns: PostalCode, Borough, and Neighborhood

In [4]:
df.shape

(180, 3)

Now, Ignore cells with a borough that is Not assigned.

In [5]:
df = df[df['Borough'] != 'Not assigned']

In [6]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [7]:
df.shape

(103, 3)

77 rows has been deleted with 'Not assigned' value of Borough.

in order to get rid of dropped index numbers, lets reset index.

In [8]:
df = df.reset_index(drop=True)

In the original version of our data, the rows are combined into one row with the neighborhoods separated with a comma.

Examine, if a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [9]:
df[['Neighbourhood']].eq('Not assigned').sum()

Neighbourhood    0
dtype: int64

Good, there is no neighbourhood with a value of 'Not assigned'

In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [10]:
df.shape

(103, 3)

## 2- Explore and cluster the neighborhoods in Toronto

#### Now, adding the Latitude and Longitude values to corresponding postal codes.

In [11]:
coor_df = pd.read_csv('Geospatial_Coordinates.csv')
coor_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### Lets merge two Dataframes into one.

In [12]:
df = pd.merge(df, coor_df, on="Postal Code")

In [13]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [14]:
# map rendering library
import folium 

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

# import k-means from clustering stage
from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors

In [15]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df['Borough'].unique()),
        df.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


In [16]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [17]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### In order to display maps in GITHUB, use HTML function of IPython library and print it out as shown below.

In [18]:
from IPython.core.display import HTML

In [19]:
HTML(map_toronto._repr_html_())

## 3- Clustering Neigborhoods in Toronto

In [20]:
k=4 # k can be a number, how many clusters you want 

toronto_clustering = df.drop(['Postal Code','Borough','Neighbourhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_

array([2, 2, 0, 3, 0, 1, 2, 3, 0, 0, 3, 1, 2, 0, 0, 0, 3, 1, 2, 0, 0, 3,
       2, 0, 0, 0, 2, 3, 3, 0, 0, 0, 2, 3, 3, 0, 0, 0, 2, 3, 3, 0, 0, 0,
       2, 3, 1, 0, 0, 1, 1, 2, 3, 1, 0, 3, 1, 1, 2, 3, 1, 3, 3, 1, 1, 2,
       3, 3, 3, 1, 1, 2, 3, 3, 0, 1, 1, 1, 2, 0, 0, 1, 2, 0, 0, 2, 0, 0,
       1, 1, 2, 0, 0, 1, 1, 2, 0, 0, 1, 0, 0, 1, 1], dtype=int32)

In [21]:
df.insert(0, 'Cluster Labels', kmeans.labels_)

In [22]:
df.head()

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,2,M3A,North York,Parkwoods,43.753259,-79.329656
1,2,M4A,North York,Victoria Village,43.725882,-79.315572
2,0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,0,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [23]:
# create map
map_clusters = folium.Map(location=[latitude, longitude],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(df['Latitude'], df['Longitude'], df['Neighbourhood'], df['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [24]:
HTML(map_clusters._repr_html_())