<h1>Segmenting and Clustering Neighborhoods in Toronto</h1>

### Webscraping and loading of information
First we have to install the required packages for webscraping:

In [4]:
!conda install beautifulsoup4
!conda install lxml
!conda install html5lib
!conda install requests

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.



Then we load the packages that are going to be used:

In [5]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import re

Next, we get the information of the Wikipedia Webpage into a html file in order to obtain only the information tha is on the table:

In [6]:
#Webscraping information
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

table = soup.find('table')

We scrape the elements within the table and append them into a Pandas DataFrame:

In [7]:
# Empty List for loading data from website
post_codes = []

for tr in table.find_all('tr')[2:]:
    post_codes.append({
            'PostalCode': tr.find_all('td')[0].text,
            'Borough': tr.find_all('td')[1].text,
            'Neighborhood': re.sub('\n$', '', tr.find_all('td')[2].text)
            })
    
post_codes = pd.DataFrame(post_codes)
post_codes.head(12)

Unnamed: 0,Borough,Neighborhood,PostalCode
0,Not assigned,Not assigned,M2A
1,North York,Parkwoods,M3A
2,North York,Victoria Village,M4A
3,Downtown Toronto,Harbourfront,M5A
4,North York,Lawrence Heights,M6A
5,North York,Lawrence Manor,M6A
6,Downtown Toronto,Queen's Park,M7A
7,Not assigned,Not assigned,M8A
8,Queen's Park,Not assigned,M9A
9,Scarborough,Rouge,M1B


Finally, we clean the dataset:
- Filter those rows without an assigned __Borough__

In [8]:
post_codes = post_codes[-post_codes['Borough'].str.contains("Not assigned")]
post_codes.head(12)

Unnamed: 0,Borough,Neighborhood,PostalCode
1,North York,Parkwoods,M3A
2,North York,Victoria Village,M4A
3,Downtown Toronto,Harbourfront,M5A
4,North York,Lawrence Heights,M6A
5,North York,Lawrence Manor,M6A
6,Downtown Toronto,Queen's Park,M7A
8,Queen's Park,Not assigned,M9A
9,Scarborough,Rouge,M1B
10,Scarborough,Malvern,M1B
12,North York,Don Mills North,M3B


- For those rows without an assigned __Neighborhood__, use the name of the corresponding __Borough__

In [9]:
post_codes['Neighborhood'][post_codes['Neighborhood'].str.contains("Not assigned")] = np.nan
post_codes['Neighborhood'] = post_codes['Neighborhood'].fillna(post_codes['Borough'])
post_codes.head(12)

Unnamed: 0,Borough,Neighborhood,PostalCode
1,North York,Parkwoods,M3A
2,North York,Victoria Village,M4A
3,Downtown Toronto,Harbourfront,M5A
4,North York,Lawrence Heights,M6A
5,North York,Lawrence Manor,M6A
6,Downtown Toronto,Queen's Park,M7A
8,Queen's Park,Queen's Park,M9A
9,Scarborough,Rouge,M1B
10,Scarborough,Malvern,M1B
12,North York,Don Mills North,M3B


- And collect all the __Neighborhoods__ that are within a certain __Postal Code__

In [10]:
post_codes = post_codes.groupby(["PostalCode", "Borough"], as_index=False)['Neighborhood'].agg(lambda x : ', '.join(set(x)))
post_codes.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Morningside, West Hill, Guildwood"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, East Birchmount Park, Ionview"
7,M1L,Scarborough,"Oakridge, Clairlea, Golden Mile"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Cliffside West, Birch Cliff"


Lastly, the DataFrame's shape:

In [11]:
#Dimensions of DataFrame
post_codes.shape

(103, 3)

Obtain the latitude and longitude of each __Postal Code__:

In [12]:
geo = 'http://cocl.us/Geospatial_data'

geo_data = pd.read_csv(geo)
geo_data.rename(columns={'Postal Code':'PostalCode'}, 
                 inplace=True)
geo_data.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Join both dataframes on __Postal Code__:

In [13]:
toronto_codes = pd.merge(post_codes, geo_data, on=['PostalCode'])
toronto_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Morningside, West Hill, Guildwood",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Import packages needed for constructing the map:

In [16]:
!conda install -c conda-forge geopy --yes 
!conda install -c conda-forge folium=0.5.0 --yes

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          91 KB

The following NEW packages will be INSTALLED:

    geographiclib: 1.50-py_0   conda-forge
    geopy:         1.20.0-py_0 conda-forge


Downloading and Extracting Packages
geographiclib-1.50   | 34 KB     | ##################################### | 100% 
geopy-1.20.0         | 57 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Solving environ

In [17]:
import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

Get the latitude and longitude of __Toronto__:

In [18]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


#### Create a map of Toronto with neighborhoods superimposed on top.

In [27]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_codes['Latitude'], toronto_codes['Longitude'], toronto_codes['Borough'], toronto_codes['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

__Foursquare Credentials and Version__

In [29]:
CLIENT_ID = 'DRF0SKJCMYWGQ5X0OO1KZJGFQL0BSF4WS53CUU53RJAVMMSD'
CLIENT_SECRET = 'DRGDZ0HCIGEG1CGWG3D1FYDREMR10XO3NF50FXYSOSNLU5BF'
VERSION = '20180605'

Create a function to use Foursquare in order to get nearby venues to each __Neighborhood__:

In [30]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Use the function define above to collect the data for only the __Neighborhoods__ within __Toronto__ __Borough__:

In [34]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

toronto_codes = toronto_codes[toronto_codes['Borough'].str.contains("Toronto")]

toronto_venues = getNearbyVenues(names=toronto_codes['Neighborhood'],
                                   latitudes=toronto_codes['Latitude'],
                                   longitudes=toronto_codes['Longitude']
                                  )

The Beaches
Riverdale, The Danforth West
India Bazaar, The Beaches West
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Summerhill West, South Hill, Deer Park, Forest Hill SE, Rathnelly
Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
King, Adelaide, Richmond
Toronto Islands, Harbourfront East, Union Station
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
Roselawn
Forest Hill West, Forest Hill North
The Annex, North Midtown, Yorkville
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Railway Lands, South Niagara, King and Spadina, Harbourfront West, Island airport
Stn A PO Boxes 25 The Esplanade
Underground city, First Canadian Place
Christie
Dufferin, Dovercourt Village
Trinity, Little Portugal
Parkdale Village, Brockton, Exhibition Place
The Junction South, High Par

Encode each __Neighborhood__ with its information on venues for clustering: 

In [35]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Group the rows by __Neighborhood__ and by taking the mean of equency of each category of venue:

In [36]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0
1,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"CN Tower, Bathurst Quay, Railway Lands, South ...",0.0,0.055556,0.055556,0.055556,0.111111,0.166667,0.111111,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Central Bay Street,0.011628,0.0,0.0,0.0,0.0,0.0,0.0,0.011628,0.0,...,0.0,0.0,0.0,0.011628,0.0,0.0,0.011628,0.0,0.0,0.0
4,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.054348,0.0,0.054348,0.01087,0.0,0.0,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.023256,0.0,0.0,0.0,0.0,0.0,0.0,0.011628,0.0,...,0.0,0.0,0.0,0.0,0.0,0.011628,0.0,0.011628,0.011628,0.0
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's define a function in order to obtain the top 10 kind of venues of each __Neighborhood__:

In [37]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [48]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Farmers Market,Seafood Restaurant,Café,Bakery,Steakhouse,Beer Bar,Cheese Shop,Beach
1,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Garden Center,Burrito Place,Fast Food Restaurant,Farmers Market,Auto Workshop,Spa,Restaurant,Garden,Smoke Shop
2,"CN Tower, Bathurst Quay, Railway Lands, South ...",Airport Service,Airport Lounge,Airport Terminal,Plane,Boat or Ferry,Airport,Airport Food Court,Airport Gate,Coffee Shop,Harbor / Marina
3,Central Bay Street,Coffee Shop,Sandwich Place,Café,Italian Restaurant,Burger Joint,Ice Cream Shop,Japanese Restaurant,Bakery,Juice Bar,Salad Place
4,"Chinatown, Grange Park, Kensington Market",Café,Vietnamese Restaurant,Vegetarian / Vegan Restaurant,Dumpling Restaurant,Chinese Restaurant,Bar,Coffee Shop,Mexican Restaurant,Bakery,Cocktail Bar


### Clustering of Neighborhoods

Define 4 diferent clusters of __Neighborhoods__:

In [49]:
# set number of clusters
kclusters = 4

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_codes

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Trail,Health Food Store,Pub,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Women's Store
41,M4K,East Toronto,"Riverdale, The Danforth West",43.679557,-79.352188,3,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Restaurant,Spa,Pub,Indian Restaurant,Diner
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,3,Park,Pizza Place,Brewery,Movie Theater,Fish & Chips Shop,Ice Cream Shop,Sushi Restaurant,Pub,Italian Restaurant,Fast Food Restaurant
43,M4M,East Toronto,Studio District,43.659526,-79.340923,3,Café,Coffee Shop,Brewery,Gastropub,Bakery,Italian Restaurant,American Restaurant,Yoga Studio,Music Store,Sandwich Place
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Lawyer,Park,Bus Line,Swim School,Dim Sum Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant


#### Visualize the Clusters

In [51]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Thank You!