# <h1 align="center"><font size="8">Applied Data Science Capstone</font></h1>
<h3 align="center"><font size="5">By Ian Riera Smolinska</font></h3>

<h2> Segmenting and Clustering Neighborhoods in Toronto. </h2>

<h3> Part 1: Scrapping of the Wikipedia list of postal codes of Canada and Dataframe creation. </h3>

Although this webpage is proposed https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M , due to continuous updates of the table that might cause some trouble with the scrapping, a fixed version of the webpage has been used: https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=933624196

In [1]:
# install the required libraries for scrapping
!pip install beautifulsoup4 # library for web scrapping
!pip install lxml # xml parser



In [1]:
# import the required libraries
import pandas as pd # library for data analsysis
import requests # library to handle requests
from bs4 import BeautifulSoup # library for scrapping

print('Libraries imported.')

Libraries imported.


In [2]:
# get the webpage we want to analyze
source = requests.get('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=933624196').text

# apply the scraper on the webpage
soup = BeautifulSoup(source, 'lxml')

# extract the table from the webpage
table = soup.find('table', {'class':'wikitable sortable'})

# to get the data from the cells
table_data=""
for tr in table.find_all('tr'):
    row_data=""
    for tds in tr.find_all('td'):
        row_data=row_data+","+tds.text
    table_data=table_data+row_data[1:]
    
print(table_data[0:500])

M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M6A,North York,Lawrence Heights
M6A,North York,Lawrence Manor
M7A,Downtown Toronto,Queen's Park
M8A,Not assigned,Not assigned
M9A,Queen's Park,Not assigned
M1B,Scarborough,Rouge
M1B,Scarborough,Malvern
M2B,Not assigned,Not assigned
M3B,North York,Don Mills North
M4B,East York,Woodbine Gardens
M4B,East York,Parkview Hill
M5B,Downtown Toronto,Ryerso


In [3]:
# define the dataframe columns
column_names = ['PostalCode','Borough', 'Neighborhood'] 

# instantiate the dataframe
df = pd.DataFrame(columns=column_names)
df

Unnamed: 0,PostalCode,Borough,Neighborhood


In [4]:
# get the table into rows
table_rows = table_data.split('\n')

# fill the dataframe with the information in each row
for row in table_rows:
    if row != '':
        data = row.split(',')
        postal_code = data[0]
        borough = data[1]
        neighborhood = data[2]
        df = df.append({'PostalCode' : postal_code, 'Borough' : borough, 'Neighborhood' : neighborhood}, ignore_index=True)

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


<h4> Preprocessing </h4>


In [5]:
# remove the Postal Codes 'Not assigned' to a Borough
df = df[df.Borough != 'Not assigned']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [6]:
# assign the borough name to 'Not assigned' neighborhoods
df.Neighborhood.replace("Not assigned", df.Borough, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [7]:
# unify neighborhoods with same code
df = df.groupby(['PostalCode', 'Borough'], sort=True).agg( ', '.join)

# the index should be restored after deleting and merging rows from the dataframe
df = df.reset_index()
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [8]:
df.shape

(103, 3)

<h3> Part 2: get the latitude and the longitude coordinates of each neighborhood. </h3>

<h4> Plan A: Tried with geocoder but calls never returned result and kernel locked. </h4>

!pip install geocoder

import geocoder # import geocoder

for postal_code in df.PostalCode:
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates, as geocoder sometimes fail to work and you need to be persistent
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    
    df = df.append({'Latitude' : latitude, 'Longitude' : longitude}, ignore_index=True)
    
df.head()

<h4>PLAN B: Using the csv with the latitude and longitudes. </h4> 

csv file: http://cocl.us/Geospatial_data

In [9]:
df_geo = pd.read_csv('http://cocl.us/Geospatial_data')

In [10]:
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [11]:
df['Latitude'] = df_geo['Latitude']
df['Longitude'] = df_geo['Longitude']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


<h3> Part 3: Clustering the neighborhoods in Toronto grouped by the postal code. </h3>

In [12]:
# import the libraries for clusttering and map representation
import json # library to handle JSON files
import numpy as np # library to handle data in a vectorized manner

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')

Libraries imported.


#### Use geopy library to get the latitude and longitude values of Toronto.

In [13]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    certifi-2019.11.28         |   py36h9f0ad1d_1         149 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    openssl-1.1.1f             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0         conda-forge
    geopy:           1

In [14]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="to_explorer")
to_location = geolocator.geocode(address)
to_latitude = to_location.latitude
to_longitude = to_location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(to_latitude, to_longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


#### Create a map of Toronto, using Folium, with neighborhoods superimposed on top.

In [15]:
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    folium-0.5.0               |             py_0          45 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         713 KB

The following NEW packages will be INSTALLED:

    altair:  4.1.0-py_1 conda-forge
    branca:  0.4.0-py_0 conda-forge
    folium:  0.5.0-py_0 conda-forge
    vincent: 0.4.4-py_1 conda-forge


Downloading and Extracting Packages
folium-0.5.0         | 45 KB     | #####

In [88]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[to_latitude, to_longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

We will focus on the boroughs from central Toronto. That is those that contain the word Toronto in its name.

In [17]:
# create a dataframe with te boroughs containing the word Toronto in its name
df_toronto = df[df['Borough'].str.contains('Toronto')].reset_index(drop=True)
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [89]:
# create map of Toronto using latitude and longitude values
map_toronto_center = folium.Map(location=[to_latitude, to_longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_center)  
    
map_toronto_center

#### Define Foursquare Credentials and Version

In [19]:
# The code was removed by Watson Studio for sharing.

We define a function to get the venues nearby.

In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode', 
                  'PostalCode Latitude', 
                  'PostalCode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Then we run it for the postal codes we choose.

In [99]:
LIMIT = 50 # limit of number of venues returned by Foresquare
radius = 700 # define radius

toronto_venues = getNearbyVenues(names=df_toronto['PostalCode'],
                                   latitudes=df_toronto['Latitude'],
                                   longitudes=df_toronto['Longitude']
                                  )

M4E
M4K
M4L
M4M
M4N
M4P
M4R
M4S
M4T
M4V
M4W
M4X
M4Y
M5A
M5B
M5C
M5E
M5G
M5H
M5J
M5K
M5L
M5N
M5P
M5R
M5S
M5T
M5V
M5W
M5X
M6G
M6H
M6J
M6K
M6P
M6R
M6S
M7A
M7Y


In [100]:
print(toronto_venues.shape)
toronto_venues.head()

(1204, 7)


Unnamed: 0,PostalCode,PostalCode Latitude,PostalCode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M4E,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,M4E,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,M4E,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,M4E,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,M4K,43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


In [101]:
toronto_venues.groupby('PostalCode').count()

Unnamed: 0_level_0,PostalCode Latitude,PostalCode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M4E,4,4,4,4,4,4
M4K,42,42,42,42,42,42
M4L,19,19,19,19,19,19
M4M,42,42,42,42,42,42
M4N,4,4,4,4,4,4
M4P,9,9,9,9,9,9
M4R,23,23,23,23,23,23
M4S,36,36,36,36,36,36
M4T,2,2,2,2,2,2
M4V,17,17,17,17,17,17


In [102]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 219 uniques categories.


Analyze each postal code.

In [104]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add postal code column back to dataframe
toronto_onehot['PostalCode'] = toronto_venues['PostalCode'] 

# move postal code column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,PostalCode,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,M4E,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,M4E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M4E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4E,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M4K,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [105]:
toronto_onehot.shape

(1204, 220)

In [106]:
toronto_grouped = toronto_onehot.groupby('PostalCode').mean().reset_index()
toronto_grouped

Unnamed: 0,PostalCode,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,M4E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M4K,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,...,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381
2,M4L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M4M,0.0,0.0,0.0,0.0,0.0,0.0,0.047619,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.02381
4,M4N,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,M4P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,M4R,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478
7,M4S,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,M4T,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,M4V,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0


Let's print each postal code along with the top 5 most common venues.

In [107]:
num_top_venues = 5

for code in toronto_grouped['PostalCode']:
    print("----"+code+"----")
    temp = toronto_grouped[toronto_grouped['PostalCode'] == code].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M4E----
               venue  freq
0  Health Food Store  0.25
1                Pub  0.25
2              Trail  0.25
3       Neighborhood  0.25
4            Airport  0.00


----M4K----
                    venue  freq
0        Greek Restaurant  0.21
1             Coffee Shop  0.10
2      Italian Restaurant  0.07
3  Furniture / Home Store  0.05
4               Bookstore  0.05


----M4L----
                  venue  freq
0        Sandwich Place  0.11
1  Fast Food Restaurant  0.05
2         Movie Theater  0.05
3          Liquor Store  0.05
4               Brewery  0.05


----M4M----
                 venue  freq
0                 Café  0.10
1          Coffee Shop  0.07
2              Brewery  0.05
3  American Restaurant  0.05
4            Gastropub  0.05


----M4N----
                             venue  freq
0                           Lawyer  0.25
1                             Park  0.25
2                         Bus Line  0.25
3                      Swim School  0.25
4  Molecular Gastro

Let's put that into a pandas dataframe, sorting the venues in descending order.

In [108]:
# function to sort the venues
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [109]:
# Now let's create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
df_venues_sorted = pd.DataFrame(columns=columns)
df_venues_sorted['PostalCode'] = toronto_grouped['PostalCode']

for ind in np.arange(toronto_grouped.shape[0]):
    df_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

df_venues_sorted.head()

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,Neighborhood,Trail,Pub,Health Food Store,Yoga Studio,Deli / Bodega,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
1,M4K,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Bookstore,Caribbean Restaurant,Pub,Lounge,Spa
2,M4L,Sandwich Place,Park,Food & Drink Shop,Burrito Place,Italian Restaurant,Liquor Store,Restaurant,Ice Cream Shop,Steakhouse,Fast Food Restaurant
3,M4M,Café,Coffee Shop,Gastropub,Brewery,Bakery,American Restaurant,Yoga Studio,Comfort Food Restaurant,Sandwich Place,Cheese Shop
4,M4N,Park,Lawyer,Bus Line,Swim School,Empanada Restaurant,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


# Cluster Neighborhoods
Run k-means to cluster the neighborhood into 4 clusters.

In [110]:
# set number of clusters
kclusters = 4

toronto_grouped_clustering = toronto_grouped.drop('PostalCode', 1)

# run k-means clustering
kmeans = KMeans(init="random", n_clusters=kclusters, n_init=12).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 1, 2, 1, 3, 1, 1, 2, 0, 2], dtype=int32)

In [111]:
# Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
# add clustering labels
df_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df_toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(df_venues_sorted.set_index('PostalCode'), on='PostalCode')

toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,Neighborhood,Trail,Pub,Health Food Store,Yoga Studio,Deli / Bodega,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,1,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Bookstore,Caribbean Restaurant,Pub,Lounge,Spa
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,2,Sandwich Place,Park,Food & Drink Shop,Burrito Place,Italian Restaurant,Liquor Store,Restaurant,Ice Cream Shop,Steakhouse,Fast Food Restaurant
3,M4M,East Toronto,Studio District,43.659526,-79.340923,1,Café,Coffee Shop,Gastropub,Brewery,Bakery,American Restaurant,Yoga Studio,Comfort Food Restaurant,Sandwich Place,Cheese Shop
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,3,Park,Lawyer,Bus Line,Swim School,Empanada Restaurant,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


In [121]:
# Finally, let's visualize the resulting clusters
# create map
map_clusters = folium.Map(location=[to_latitude, to_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
colors_array = cm.rainbow(np.linspace(0, 1, 2*len(x)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['PostalCode'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

- Red = cluster 0.
- Purple = cluster 1.
- Dark Blue = cluster 2.
- Light Blue = cluster 3.

Examining the clusters, you can determine the discriminating venue categories that distinguish each cluster. 

In [128]:
# Cluster 0
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[0] + [1] + [2] + [3] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,0,Playground,Gym,Yoga Studio,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


For cluster 0, we have only one postal code, corresponding to neighborhoods of Moore Park and Summerhill East in Central Toronto. We will try to determine what makes it different after analyzing the remaining clusters.

In [129]:
# Cluster 1
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[0] + [1] + [2] + [3] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,1,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Bookstore,Caribbean Restaurant,Pub,Lounge,Spa
3,M4M,East Toronto,Studio District,43.659526,1,Café,Coffee Shop,Gastropub,Brewery,Bakery,American Restaurant,Yoga Studio,Comfort Food Restaurant,Sandwich Place,Cheese Shop
5,M4P,Central Toronto,Davisville North,43.712751,1,Park,Asian Restaurant,Sandwich Place,Food & Drink Shop,Department Store,Hotel,Breakfast Spot,Convenience Store,Gym,Cosmetics Shop
6,M4R,Central Toronto,North Toronto West,43.715383,1,Clothing Store,Coffee Shop,Pet Store,Salon / Barbershop,Café,Restaurant,Rental Car Location,Chinese Restaurant,Park,Sporting Goods Shop
11,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,1,Coffee Shop,Pub,Italian Restaurant,Bakery,Pizza Place,Restaurant,Café,Furniture / Home Store,Indian Restaurant,Sandwich Place
12,M4Y,Downtown Toronto,Church and Wellesley,43.66586,1,Yoga Studio,Gay Bar,Coffee Shop,Men's Store,Restaurant,Gastropub,Ramen Restaurant,Ice Cream Shop,Burrito Place,Burger Joint
13,M5A,Downtown Toronto,Harbourfront,43.65426,1,Coffee Shop,Pub,Park,Theater,Mexican Restaurant,Breakfast Spot,Café,Restaurant,Bakery,Shoe Store
14,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,1,Coffee Shop,Café,Bookstore,Italian Restaurant,Restaurant,Cosmetics Shop,Clothing Store,Theater,Ramen Restaurant,Beer Bar
15,M5C,Downtown Toronto,St. James Town,43.651494,1,Italian Restaurant,Coffee Shop,Café,Park,Farmers Market,Bakery,BBQ Joint,Japanese Restaurant,Restaurant,Thai Restaurant
16,M5E,Downtown Toronto,Berczy Park,43.644771,1,Coffee Shop,Beer Bar,Seafood Restaurant,Restaurant,Farmers Market,Bakery,Café,Cheese Shop,Cocktail Bar,Concert Hall


For cluster 1, we have 29 different postal codes, including all the corresponding to Downtown Toronto. At a first glance, we can see that they all share having Coffee shops, Cafés, bars and restaurants as the top revenues. We can say that this is the "Restaurant and coffee time" cluster.

In [134]:
# Cluster 2
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[0] + [1] + [2] + [3] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,2,Neighborhood,Trail,Pub,Health Food Store,Yoga Studio,Deli / Bodega,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,2,Sandwich Place,Park,Food & Drink Shop,Burrito Place,Italian Restaurant,Liquor Store,Restaurant,Ice Cream Shop,Steakhouse,Fast Food Restaurant
7,M4S,Central Toronto,Davisville,43.704324,2,Sandwich Place,Pizza Place,Dessert Shop,Coffee Shop,Sushi Restaurant,Gym,Café,Italian Restaurant,Toy / Game Store,Gourmet Shop
9,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,2,Coffee Shop,Pub,Light Rail Station,Supermarket,Sushi Restaurant,Vietnamese Restaurant,Park,Burger Joint,American Restaurant,Fried Chicken Joint
24,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,2,Café,Sandwich Place,Coffee Shop,History Museum,Middle Eastern Restaurant,Burger Joint,Liquor Store,Indian Restaurant,Pub,BBQ Joint
36,M6S,West Toronto,"Runnymede, Swansea",43.651571,2,Pizza Place,Café,Coffee Shop,Italian Restaurant,Sushi Restaurant,Yoga Studio,Gastropub,Restaurant,Pub,Latin American Restaurant


 For cluster 2, we have 6 different postal codeshill East in Central Toronto. As the previous cluster, we can observe a big presence of restaurants and drink places. However, there is a distinctive venue in this Neighborhoods: Pizza and Sanwich places. It is more take away kind of foods than sitting in restaurants. We well call it the "Grab, eat and go" cluster.

In [133]:
# Cluster 3
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[0] + [1] + [2] + [3] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,M4N,Central Toronto,Lawrence Park,43.72802,3,Park,Lawyer,Bus Line,Swim School,Empanada Restaurant,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
10,M4W,Downtown Toronto,Rosedale,43.679563,3,Park,Playground,Trail,Dance Studio,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run,Distribution Center
22,M5N,Central Toronto,Roselawn,43.711695,3,Pool,Health & Beauty Service,Garden,Yoga Studio,Deli / Bodega,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
23,M5P,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,3,Jewelry Store,Trail,Park,Bus Line,Sushi Restaurant,Home Service,Cosmetics Shop,Costume Shop,Comfort Food Restaurant,Eastern European Restaurant


For cluster 3, we have 4 different postal codes. Looking at the top 3 venues, we can see that they share having parks, trails and gardens. We can say that this is the "Play outside" cluster.

So cluster 1 is focused on coffee and restaurants, cluster 2 in sandwich and pizza places and cluster 3 on parks and outdoors. Cluster 0, indeed has no coffee, pizza or sandwich places, not many restaurants nor parks. We can see in the top 3, a playground, a gym and a yoga place. So we can call it the "Keep fit" cluster.

Therefore, if you visit Toronto you should check: \n
- Cluster 0 neighborhoods if you want to practice some sport.
- Cluster 1 if you are looking for a date in a nice cofffe shop or restaurant.
- Cluster 2 if you just look for a sandwich or slice of pizza to calm down the stomach and keep going.
- Cluster 3 if all you need is a walk through the park.

### Thank you for checking my project.

This notebook was created by [Ian Riera Smolinska](https://www.linkedin.com/in/ianrierasmolinska/) for the completition of the [Applied Data Science Capstone](https://www.coursera.org/learn/applied-data-science-capstone) course from Coursera.