

<br><br>

<h1><center>Applied Data Science Capstone Course</center></h1>

## Week 3 Assignment: 'Segmenting and Clustering Neighborhoods in Toronto'

## Author: Diego Medeiros

### GitHub: https://github.com/medeirox


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Data Scraping of Toronto's Neighborhood Data</a>

2. <a href="#item2">Collecting Longitude and Latitude from GeoPy</a>

3. <a href="#item3">Analyze Each Neighborhood</a>

</font>
</div>

<a id='item1'></a>

## Data Scraping of Toronto's Neighborhood Data

First we'll collect data from Wikipedia (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) and process it to fit in a DataFrame using Pandas

In [1]:
import pandas as pd

In [2]:
d = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

In [3]:
df = d[0]
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


### Assign Borough to Neighbourhoods cointaining "Not assigned" information and drop rows with "Not assigned" Borough

In [4]:
df[df['Neighbourhood'].str.contains('Not assigned')] = df[['Postcode','Borough','Borough']]
df = df[df['Borough']!='Not assigned']
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


 ### Process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [5]:
df_clean = df[df['Borough'] != 'Not assigned']
df_clean.columns = ['PostalCode', 'Borough', 'Neighbourhood']
df_clean.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


### Aggregate by Postcode and Borough, appending all Neighbourhoods from a given Borough in a single row

In [6]:
pc_groups = df_clean.groupby(['PostalCode', 'Borough']).agg(lambda x: ', '.join(x))
pc_groups.reset_index(inplace=True)
pc_groups.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Checking DataFrame information

In [7]:
pc_groups.shape

(103, 3)

In [8]:
pc_groups.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
PostalCode       103 non-null object
Borough          103 non-null object
Neighbourhood    103 non-null object
dtypes: object(3)
memory usage: 2.5+ KB


In [9]:
pc_groups.describe()

Unnamed: 0,PostalCode,Borough,Neighbourhood
count,103,103,103
unique,103,11,103
top,M4G,North York,Northwest
freq,1,24,1


<a id='item2'></a>

## Collecting Longitude and Latitude from GeoPy

In [10]:
#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
#!conda install -c conda-forge folium --yes
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### Initialize Nominatim to collect Latitude and Longitude from locations

In [11]:
# Due to problems using the GeoPy package, I'll use the csv provided by the course
!wget -q -O 'toronto_postalcodes.csv' https://cocl.us/Geospatial_data

In [12]:
df_postalcodes = pd.read_csv('toronto_postalcodes.csv')
df_postalcodes.set_index('Postal Code', inplace=True)
df_postalcodes.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [13]:
for pc in pc_groups['PostalCode']:
    df_postalcodes.loc[pc]

In [14]:
import time
city = 'Toronto'
state = 'ON'

address = '{ct}, {st}'.format(ct=city, st=state)

# Geolocator failing to get all Postal Codes...
#geolocator = Nominatim(user_agent="toronto_explorer")

df_latlon = pd.DataFrame(columns=['Latitude','Longitude'])

for pc in pc_groups['PostalCode']:
    df_latlon=df_latlon.append(df_postalcodes.loc[pc])
    
#print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))
df_latlon.head()

Unnamed: 0,Latitude,Longitude
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [15]:
print('Shape of df_latlon: {}'.format(df_latlon.shape))
print('Shape of pc_groups: {}'.format(pc_groups.shape))

Shape of df_latlon: (103, 2)
Shape of pc_groups: (103, 3)


In [16]:
df_latlon.reset_index(inplace=True)
pc_groups = pd.concat([pc_groups, df_latlon], axis=1)

In [17]:
pc_groups.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,index,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476


<a id='item3'></a>

## Analyze Each Neighborhood

In [18]:
import folium

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

In [24]:
df_explore = pc_groups[pc_groups['Borough'].str.contains('Toronto')]
df_explore.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood,index,Latitude,Longitude
37,M4E,East Toronto,The Beaches,M4E,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",M4K,43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",M4L,43.668999,-79.315572
43,M4M,East Toronto,Studio District,M4M,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,M4N,43.72802,-79.38879
45,M4P,Central Toronto,Davisville North,M4P,43.712751,-79.390197
46,M4R,Central Toronto,North Toronto West,M4R,43.715383,-79.405678
47,M4S,Central Toronto,Davisville,M4S,43.704324,-79.38879
48,M4T,Central Toronto,"Moore Park, Summerhill East",M4T,43.689574,-79.38316
49,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",M4V,43.686412,-79.400049


In [25]:
df_explore.shape

(38, 6)

There are 38 items with the word 'Toronto' in the Borough

In [26]:
df_explore.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38 entries, 37 to 87
Data columns (total 6 columns):
PostalCode       38 non-null object
Borough          38 non-null object
Neighbourhood    38 non-null object
index            38 non-null object
Latitude         38 non-null float64
Longitude        38 non-null float64
dtypes: float64(2), object(4)
memory usage: 2.1+ KB


In [27]:
df_explore.describe()

Unnamed: 0,Latitude,Longitude
count,38.0,38.0
mean,43.667262,-79.389883
std,0.02378,0.037954
min,43.628947,-79.48445
25%,43.649363,-79.405678
50%,43.662152,-79.385975
75%,43.678757,-79.375946
max,43.72802,-79.293031


Here the map will be instantiated

In [31]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

toronto_map = folium.Map(location=[location.latitude, location.longitude], zoom_start=11)


Now, all the Postal Codes will be plotted on the map with it's respective Borough's name

In [30]:
from folium import plugins

# let's start again with a clean copy of the map of San Francisco
toronto_map = folium.Map(location = [location.latitude, location.longitude], zoom_start = 12)

# instantiate a mark cluster object for the incidents in the dataframe
postal_codes = plugins.MarkerCluster().add_to(toronto_map)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, bor in zip(df_explore['Latitude'], df_explore['Longitude'], df_explore['PostalCode'], df_explore['Borough']):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=bor + ' - ' + label,
    ).add_to(postal_codes)

# display map
toronto_map