# Segmenting and Clustering Neighborhoods in Toronto

## 1. Scraping
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe


We first install Beautiful Soup package

In [1]:
# install Beautiful Soup package and import it
! pip install bs4
from bs4 import BeautifulSoup

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jupyterlab/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


To parse the document, we will pass it into the BeautifulSoup constructor.

In [2]:
!wget --quiet https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M -O CodesCanada

with open("CodesCanada") as fp:
    soup = BeautifulSoup(fp)

Then, we create a pandas dataframe that will contain the postal codes

In [3]:
# import pandas libraries
import pandas as pd

# define the column of our dataframe
column_names=['Postalcode','Borough','Neighbourhood']

In [4]:
# create dataframe
df_CanadaPostCodes_raw=pd.DataFrame(columns=column_names)

#count number of postalcode
n_max=len(list(soup.body('tr')))-5

#populate the dataframe with the Postalcode, the Borough and Neihborhoods
for i in range(1,n_max):
    Postalcode=soup.find_all('tr')[i].contents[1].text
    Borough=soup.find_all('tr')[i].contents[3].text
    Neighbourhood=soup.find_all('tr')[i].contents[5].text
    
    df_CanadaPostCodes_raw=df_CanadaPostCodes_raw.append({'Postalcode':Postalcode,'Borough':Borough,'Neighbourhood':Neighbourhood},ignore_index=True)

    #check data
df_CanadaPostCodes_raw.head(10)

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n
5,M5A,Downtown Toronto,Regent Park\n
6,M6A,North York,Lawrence Heights\n
7,M6A,North York,Lawrence Manor\n
8,M7A,Queen's Park,Not assigned\n
9,M8A,Not assigned,Not assigned\n


## 2. Clean the dataset
1. Only process the cells that have an assigned borough.
2. remove the \n at the end of Neighbourhood names
3. if Neighbourhood has a value 'Not assigned', replace the value by the "Borough" value
4. Aggregate Postalcode

In [22]:
# 1) We drop lines that have no assigned borough.
df_CanadaPostCodes = df_CanadaPostCodes_raw[~df_CanadaPostCodes_raw['Borough'].isin(['Not assigned'])].reset_index(drop=True)

# 2) remove the \n at the end of Neighbourhood names
df_CanadaPostCodes['Neighbourhood'] = df_CanadaPostCodes.Neighbourhood.str.replace('\n', '')

# 3) if Neighbourhood has a value 'Not assigned',we replace the value by the "Borough" value
df = df_CanadaPostCodes.applymap(str)
for i in df_CanadaPostCodes.index:
    if df_CanadaPostCodes.Neighbourhood[i]=='Not assigned':
        df_CanadaPostCodes.Neighbourhood[i]=df_CanadaPostCodes.Borough[i]
df_CanadaPostCodes.head(15)

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [41]:
# 4) Aggregate PostalCode
df_CanadaPostCodes_agg=df_CanadaPostCodes.groupby(['Postalcode']).agg({'Postalcode':'first','Borough':'first','Neighbourhood': lambda a: " , ".join(a)})
df_CanadaPostCodes_agg.drop('Postalcode', axis=1)
df_CanadaPostCodes_agg.reset_index(drop=True, inplace=True)
df_CanadaPostCodes_agg

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge , Malvern"
1,M1C,Scarborough,"Highland Creek , Rouge Hill , Port Union"
2,M1E,Scarborough,"Guildwood , Morningside , West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park , Ionview , Kennedy Park"
7,M1L,Scarborough,"Clairlea , Golden Mile , Oakridge"
8,M1M,Scarborough,"Cliffcrest , Cliffside , Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff , Cliffside West"


In [42]:
df_CanadaPostCodes_agg.shape

(103, 3)

# 3. Map Postal Codes to Lat and Long

1.Import data

In [35]:
!wget -O PostalCode.csv https://cocl.us/Geospatial_data

--2019-02-18 18:50:56--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 169.48.113.201
Connecting to cocl.us (cocl.us)|169.48.113.201|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-02-18 18:50:59--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.27.197
Connecting to ibm.box.com (ibm.box.com)|107.152.27.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-02-18 18:50:59--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-02-18 

In [36]:
#Assign the data to a panda dataframe
df_PostalCodeCoord = pd.read_csv("PostalCode.csv")

# take a look at the dataset
df_PostalCodeCoord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


2.Merge with Postal Code dataframe

In [43]:
#rename column "Postal Code" in dataframe df_PostalCodeCoord
df_PostalCodeCoord.rename(columns={'Postal Code':'Postalcode'}, inplace=True)

#merge based on Postalcode column
df_CanadaPostCodes_agg=pd.merge(df_CanadaPostCodes_agg,df_PostalCodeCoord, on='Postalcode')

#check merge
df_CanadaPostCodes_agg

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge , Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek , Rouge Hill , Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood , Morningside , West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park , Ionview , Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea , Golden Mile , Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest , Cliffside , Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff , Cliffside West",43.692657,-79.264848


# 4. Explore Toronto
The aim of this section is to show on a map: my position (dod in green), restaurants for which the distance is lower than the means of all restaurants in a radius of 5km (dod in red) and restaurants for which the distance is high than the means of all restaurants in a radius of 5km.

First, we import the libraries

In [66]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.6.4                |           py36_0         877 KB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    geopy-1.18.1               |             py_0          51 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         961 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.49-py_0

The following packages will be UPDATED:

  conda                                        4.6.3-py36_0 --> 4.6.4-py36_0
  geopy              conda-forge/linux-64::geopy-1.11.0-py~ --> conda-forge/noarch::geopy-1.

Set up query parameters 

In [75]:
CLIENT_ID = '1SEG22KQUVDW0BIUOJ10GSAM1EOG1DBBFRZYWQCNUSKMTZVP' # your Foursquare ID
CLIENT_SECRET = 'ZDVVBSJMWWEWZAGPIDK4ANIVP2O0QGGI0JPGDV0CALUREZ2N' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 100
search_query = 'restaurant'

latitude = 43.806686
longitude = -79.194353
radius=5000

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/search?client_id=1SEG22KQUVDW0BIUOJ10GSAM1EOG1DBBFRZYWQCNUSKMTZVP&client_secret=ZDVVBSJMWWEWZAGPIDK4ANIVP2O0QGGI0JPGDV0CALUREZ2N&ll=43.806686,-79.194353&v=20180604&query=restaurant&radius=5000&limit=100'

Send GET request and examine results

In [105]:
import requests
results = requests.get(url).json()
venues=results['response']['venues']

In [106]:
# tranform venues into a dataframe
dataframe = json_normalize(venues)
dataframe.head()

Unnamed: 0,categories,hasPerk,id,location.address,location.cc,location.city,location.country,location.crossStreet,location.distance,location.formattedAddress,location.labeledLatLngs,location.lat,location.lng,location.neighborhood,location.postalCode,location.state,name,referralId,venuePage.id
0,"[{'id': '4bf58dd8d48988d143941735', 'name': 'B...",False,4be6c179d4f7c9b665042720,404 Old Kingston Rd.,CA,Scarborough,Canada,,3194,"[404 Old Kingston Rd., Scarborough ON, Canada]","[{'label': 'display', 'lat': 43.78446796744621...",43.784468,-79.1692,,,ON,Ted's Restaurant,v-1550518728,
1,"[{'id': '4bf58dd8d48988d1f9941735', 'name': 'F...",False,4c97a82382b56dcbf7afebaa,,CA,Toronto Division,Canada,,1636,"[Toronto Division ON, Canada]","[{'label': 'display', 'lat': 43.81958724358436...",43.819587,-79.184574,,,ON,Africa Restaurant,v-1550518728,
2,"[{'id': '4bf58dd8d48988d1c4941735', 'name': 'R...",False,4b9eb453f964a5201bfb36e3,2818 Markham Rd,CA,Scarborough,Canada,,4635,"[2818 Markham Rd, Scarborough ON M1X 1E6, Canada]","[{'label': 'display', 'lat': 43.8229066, 'lng'...",43.822907,-79.247505,,M1X 1E6,ON,Nirala Sweets & Restaurant,v-1550518728,
3,"[{'id': '52e81612bcbc57f1066b79f1', 'name': 'B...",False,583f175dfbe8ff549afaa9fd,,CA,Toronto,Canada,,3820,"[Toronto ON M1G, Canada]","[{'label': 'display', 'lat': 43.7844523004027,...",43.784452,-79.230578,,M1G,ON,The Local Cafe Restaurant,v-1550518728,
4,"[{'id': '4bf58dd8d48988d145941735', 'name': 'C...",False,4b5738d3f964a520c62b28e3,4810 Sheppard Ave. E,CA,Scarborough,Canada,at Shorting Rd.,4782,"[4810 Sheppard Ave. E (at Shorting Rd.), Scarb...","[{'label': 'display', 'lat': 43.791806, 'lng':...",43.791806,-79.250197,,M1S 4N6,ON,Best Friends Chinese Restaurant 會賓樓,v-1550518728,


In [131]:
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
dataframe_filtered = dataframe.loc[:, filtered_columns]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

# clean column names by keeping only last term
dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

dataframe_filtered.head()

Unnamed: 0,name,categories,address,cc,city,country,crossStreet,distance,formattedAddress,labeledLatLngs,lat,lng,neighborhood,postalCode,state,id
0,Ted's Restaurant,Breakfast Spot,404 Old Kingston Rd.,CA,Scarborough,Canada,,3194,"[404 Old Kingston Rd., Scarborough ON, Canada]","[{'label': 'display', 'lat': 43.78446796744621...",43.784468,-79.1692,,,ON,4be6c179d4f7c9b665042720
1,Africa Restaurant,Food & Drink Shop,,CA,Toronto Division,Canada,,1636,"[Toronto Division ON, Canada]","[{'label': 'display', 'lat': 43.81958724358436...",43.819587,-79.184574,,,ON,4c97a82382b56dcbf7afebaa
2,Nirala Sweets & Restaurant,Restaurant,2818 Markham Rd,CA,Scarborough,Canada,,4635,"[2818 Markham Rd, Scarborough ON M1X 1E6, Canada]","[{'label': 'display', 'lat': 43.8229066, 'lng'...",43.822907,-79.247505,,M1X 1E6,ON,4b9eb453f964a5201bfb36e3
3,The Local Cafe Restaurant,Bistro,,CA,Toronto,Canada,,3820,"[Toronto ON M1G, Canada]","[{'label': 'display', 'lat': 43.7844523004027,...",43.784452,-79.230578,,M1G,ON,583f175dfbe8ff549afaa9fd
4,Best Friends Chinese Restaurant 會賓樓,Chinese Restaurant,4810 Sheppard Ave. E,CA,Scarborough,Canada,at Shorting Rd.,4782,"[4810 Sheppard Ave. E (at Shorting Rd.), Scarb...","[{'label': 'display', 'lat': 43.791806, 'lng':...",43.791806,-79.250197,,M1S 4N6,ON,4b5738d3f964a520c62b28e3


Plot my position (dod in green), restaurants for which the distance is lower than the means of all restaurants in a radius of 5km (dod in red) and restaurants for which the distance is high than the means of all restaurants in a radius of 5km

In [136]:
venues_map = folium.Map(location=[latitude, longitude], zoom_start=13) # generate map centred around the Conrad Hotel
far=0
close=0
# add the Italian restaurants as blue circle markers
for distance, lat, lng, label in zip(dataframe_filtered.distance, dataframe_filtered.lat, dataframe_filtered.lng, dataframe_filtered.categories):
    if distance>dataframe_filtered.distance.astype('float').mean(axis=0):
        folium.features.CircleMarker(
            [lat, lng],
            radius=5,
            color='blue',
            popup=label,
            fill = True,
            fill_color='blue',
            fill_opacity=0.6
        ).add_to(venues_map)
        
        far=far+1
    
    else:
        folium.features.CircleMarker(
            [lat, lng],
            radius=5,
            color='red',
            popup=label,
            fill = True,
            fill_color='red',
            fill_opacity=0.6
        ).add_to(venues_map)
        close=close+1
    
    folium.features.CircleMarker(
            [latitude, longitude],
            radius=10,
            color='green',
            popup="My position",
            fill = True,
            fill_color='green',
            fill_opacity=0.6
        ).add_to(venues_map)
venues_map

In [137]:
# print resutarants that are close vs far
print(close)
print(far)

21
29


# 5. Observations:
1. Most restaurants are located on the left of my position.
2. 21 restaurants are located in a distance that is less than the mean
3. 29 restaurants are located in a distance that is higher than the mean