# Segmentation and Clustering
In this assignment, I will segment, and cluster the neighborhoods in the city of Toronto using the postal code and borough information. 

This is done in three parts:

     1. Sourcing data and cleaning it
     2. Combining the cleaned data with latitude and longitude coordinates
     3. Visualizing the data

## 1. Sourcing Data and Data Wrangling


The neighborhood data is not readily available on the internet, however, a [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,) exists that has all the information we need to explore and cluster the neighborhoods in Toronto. 
I will use [Requests](https://realpython.com/python-requests/#the-get-request), a Python de facto HTTP library, to scrape the Wikipedia page, wrangle the data, clean it, and then read it into a pandas  dataframe so that the data is in a structured format. 

In [1]:
# Import necessary libraries
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import requests # Library to get HTTP requests
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # Uncomment to install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # Uncomment to install folium
import folium # map rendering library

print('Libraries imported.')

usage: conda-script.py [-h] [-V] command ...
conda-script.py: error: unrecognized arguments: # Uncomment to install geopy
usage: conda-script.py [-h] [-V] command ...
conda-script.py: error: unrecognized arguments: # Uncomment to install folium


Libraries imported.


Using the GET method to get or retrieve data from a specified resource

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

toronto_postal_codes = requests.get(url)

# Checking to see if the request was successful 
response = requests.get(url)

if response.status_code == 200:
    print('Success!')
elif response.status_code == 404:
    print('Not Found.')

Success!


Now that we have the data, let's create the columns and read it into a pandas dataframe

In [3]:
# Create coloumns for dataframe
column_names = ["Postal Code",  "Borough", "Neighbourhood"]

# Reading the url into the dataframe
toronto_data = pd.read_html(toronto_postal_codes.text, header = 0)
toronto_data = toronto_data[0]

toronto_data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In this step we deal with redundant information. All the neighbourhoods that have the same Postal Code will be listed together under one code.

In [4]:
# Combining neighbourhoods with the same postal code
toronto_data = toronto_data.groupby(["Postal Code","Borough"], sort=False).agg(', '.join)
toronto_data.reset_index(inplace=True)

toronto_data.head()


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


This step deals with missing information. In the event that the Neighbourhood is 'not assigned', it will be replaced by the Borough value. However, if the Borough is 'not assigned' the row will be dropped.

In [5]:
# Replacing missing 'Neighbourhood' value  with the 'Borough' value
toronto_data["Neighbourhood"] = np.where(toronto_data["Neighbourhood"] == 'Not assigned',toronto_data["Borough"], toronto_data["Neighbourhood"])

# Slice the dataframe for all unassigned borroughs
not_assigned = toronto_data[toronto_data["Borough"] == 'Not assigned'].index

# Dropping rows where 'Borough' is unassigned
toronto_data.drop(not_assigned, axis= 0, inplace=True)
toronto_data.reset_index(drop = True, inplace = True)

toronto_data.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Check the shape of the clean dataframe

In [6]:
toronto_data.shape

(103, 3)

## 2. Adding Latitude and Longitude


Now that we have built a dataframe of the postal code of each neighbourhood along with the borough name and neighbourhood name, we need to get the latitude and the longitude coordinates of each neighbourhood. 

In [7]:
# Read csv file into dataframe
latitude_longitude = pd.read_csv("https://cocl.us/Geospatial_data")
latitude_longitude.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [8]:
latitude_longitude.rename(columns={"Postal Code":"Postal Code"},inplace=True)
latitude_longitude.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
# Add the coordinates to Toronto dataframe based on postal code
toronto_data = pd.merge(toronto_data,latitude_longitude,on="Postal Code")
toronto_data.head(15)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## 3. Visualizing Neighborhoods

Now that we have combined the postal codes with the corresponding coordinates we can explore the data, before moving onto visualizing clusters. 
Let's see how many boroughs and neighborhoods there are in Toronto.

In [10]:
# Check the number of unique boroughs in the dataframe
print('The dataframe has {} boroughs and {} neighbourhoods.'.format(
        len(toronto_data["Borough"].unique()),
        toronto_data.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighbourhoods.


In [None]:
# List unique bouroughs
toronto_data.Borough.unique()

Let's generate a map of all the neighbourhoods in Toronto. In order to do that, we first need to asssert the geopgraphical coordinates of Toronto.

In [11]:
address = 'Toronto'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Creating the map with markers and labels for the neighboorhoods. 

In [12]:
# create map of york_data using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighbourhood in zip(toronto_data["Latitude"], toronto_data["Longitude"], toronto_data["Borough"], toronto_data["Neighbourhood"]):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

As we can see, there is quite a dense area of neighbourhoods clustered together in the south. The neighbourhoods seem to be less dense up north and there seems to be more neighbourhoods in the eastern boarder. 

For simplicity, let's use only the boroughs that contain the word 'york' to visualize the neighbourhoods.

In [14]:
# Slice the dataframe for neighbourhoods related to York
york_data = toronto_data[toronto_data['Borough'].isin(['York', 'North York', 'East York'])].reset_index(drop=True)
york_data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
3,M3B,North York,Don Mills,43.745906,-79.352188
4,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937


Check the shape of the new dataframe to confirm how many boroughs we will mark on the map.

In [15]:
york_data.shape

(34, 5)

The shape confirms that the York neighbourhoods make up approximately 33% of the total number of neighbourhoods in Toronto.
Let's explore this further by finding the geographical coordinates of York.

In [16]:
address = 'York, Toronto'

geolocator = Nominatim(user_agent="york_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of York are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of York are 43.6896191, -79.479188.


Now we can visualize all the York neighbourhoods with intera

In [17]:
# create map of york_data using latitude and longitude values
map_york = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighbourhood in zip(york_data["Latitude"], york_data["Longitude"], york_data["Borough"], york_data["Neighbourhood"]):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_york)  
    
map_york

As compared to the map of the whole of Toronto, we can see that York takes up most of northern Toronto. These neighbourhoods are further apart from each other than those in central Toronto with a number of them around the airport. 