# Segmenting and Clustering Neighborhoods in Toronto

## Part 1
### Installing dependencies

In [68]:
import sys
!conda install --quiet --yes --prefix {sys.prefix} numpy
!conda install --quiet --yes --prefix {sys.prefix} pandas
!conda install --quiet --yes --prefix {sys.prefix} lxml
!conda install --quiet --yes --prefix {sys.prefix} requests
!conda install --quiet --yes --prefix {sys.prefix} -c conda-forge geopy
!conda install --quiet --yes --prefix {sys.prefix} -c conda-forge folium=0.5.0
!conda install --yes --prefix {sys.prefix} -c conda-forge scikit-learn

print('Dependencies installed')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /home/ivan/anaconda3/envs/Coursera_Capstone

  added / updated specs:
    - numpy


The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2019.11.~ --> pkgs/main::ca-certificates-2020.1.1-0

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            conda-forge::certifi-2019.11.28-py37h~ --> pkgs/main::certifi-2019.11.28-py37_1
  openssl            conda-forge::openssl-1.1.1f-h516909a_0 --> pkgs/main::openssl-1.1.1f-h7b6447c_0


Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metada

### Importing libraries

In [9]:
import numpy as np
import pandas as pd
import folium 
import requests 
from geopy.geocoders import Nominatim 
from pandas.io.json import json_normalize
from sklearn.cluster import KMeans

print('Libraries imported')

Libraries imported


### Reading HTML with pandas

In [10]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(url)[0]
print(df.shape)
df.head()

(180, 3)


Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


### Only process the cells that have an assigned borough.
Ignore cells with a borough that is Not assigned.

In [11]:
df_cleared = df[df['Borough'] != 'Not assigned']
df_cleared = df_cleared.reset_index()
df_cleared = df_cleared.drop('index', axis=1)
print(df_cleared.shape)
df_cleared.head()

(103, 3)


Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


### If a cell has a borough but a Not assigned neighborhood.
Then the neighborhood will be the same as the borough.   
Let's check

In [12]:
df_cleared[df_cleared['Neighborhood' ] == 'Not assigned']

Unnamed: 0,Postal code,Borough,Neighborhood


### More than one neighborhood can exist in one postal code area.
Rows will be combined into one row with the neighborhoods separated with a comma.   

Let's check if one or more postal code have more than one neighborhood

In [13]:
print(len(df_cleared['Postal code']) == len(np.unique(df_cleared['Postal code'])))

True


In [14]:
df_cleared['Postal code'].value_counts()

M6S    1
M7R    1
M9P    1
M8Y    1
M4S    1
      ..
M5H    1
M4V    1
M4B    1
M1S    1
M5B    1
Name: Postal code, Length: 103, dtype: int64

In [15]:
# Sort by postal code
df_cleared.sort_values(by=['Postal code'], inplace=True)
df_cleared = df_cleared.reset_index()
df_cleared = df_cleared.drop('index', axis=1)
df_cleared.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [16]:
df_cleared.shape

(103, 3)

<hr>

## Part 2
### Getting Latitude and Longitude
Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data,
we need to get the latitude and the longitude coordinates of each neighborhood.

Let's read the coordinates

In [17]:
df_coors = pd.read_csv('Geospatial_Coordinates.csv')

Let's sort the dataframe so it match with the main dataframe (df_cleared)

In [18]:
# Sort by postal code
df_coors.sort_values(by=['Postal Code'], inplace=True)
df_coors = df_coors.reset_index()
df_coors = df_coors.drop('index', axis=1)
df_coors.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [19]:
df_coors.shape

(103, 3)

Dataframes to be used
    1. df_cleared
    2. df_coors
    
Drop the postal code column of the second dataframe

In [20]:
df_lat_lng = df_coors.drop(columns=['Postal Code'], axis=1)
df_lat_lng.head()

Unnamed: 0,Latitude,Longitude
0,43.806686,-79.194353
1,43.784535,-79.160497
2,43.763573,-79.188711
3,43.770992,-79.216917
4,43.773136,-79.239476


In [21]:
df_canada = pd.concat([df_cleared, df_lat_lng], axis=1 )
df_canada.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


<hr>

## Part 3  
### Cluster Neighborhoods
Explore and cluster the neighborhoods in Toronto.
You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Set number of clusters

In [22]:
df_canada.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Let's plot the Neighborhoods to explore them

In [23]:
lat_canada = 43.7001100
lng_canada = -79.4163000

In [24]:
map_canada = folium.Map(location=[lat_canada, lng_canada], zoom_start=10)  
 
 # add markers to map  
for lat, lng, borough, neighborhood in zip(df_canada['Latitude'], df_canada['Longitude'], df_canada['Borough'], df_canada['Neighborhood']):  
    label = '{}, {}'.format(neighborhood, borough)  
    label = folium.Popup(label, parse_html=True)  
    folium.CircleMarker(
        [lat, lng],  
        radius=5,  
        popup=label,  
        color='blue',  
        fill=True,  
        fill_color='#3186cc',  
        fill_opacity=0.7,  
        parse_html=False).add_to(map_canada)
    
map_canada

Let's assign a color for each borough

In [25]:
# How many borough are there
df_canada['Borough'].value_counts()

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
York                 5
East York            5
East Toronto         5
Mississauga          1
Name: Borough, dtype: int64

In [26]:
# Assign a color
borough_colors = {
    'North York': '#EA2027',
    'Downtown Toronto': '#006266',
    'Scarborough': '#1B1464',
    'Etobicoke': '#5758BB',
    'Central Toronto': '#6F1E51',
    'West Toronto': '#EE5A24',
    'York': '#009432',
    'East York': '#0652DD',
    'East Toronto': '#9980FA',
    'Mississauga': '#833471',
}

In [28]:
# Plot
map_canada = folium.Map(location=[lat_canada, lng_canada], zoom_start=10)  
 
 # add markers to map  
for lat, lng, borough, neighborhood in zip(df_canada['Latitude'], df_canada['Longitude'], df_canada['Borough'], df_canada['Neighborhood']):  
    label = '{}, {}'.format(neighborhood, borough)  
    label = folium.Popup(label, parse_html=True)  
    folium.CircleMarker(
        [lat, lng],  
        radius=5,  
        popup=label,  
        color=borough_colors[borough],  
        fill=True,  
        fill_color=borough_colors[borough],  
        fill_opacity=0.7,  
        parse_html=False).add_to(map_canada)
    
map_canada
