<h1 align=center><font size = 6><b>Segmenting and Clustering Neighborhoods in Toronto</b></font></h1>

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Question 1:</a>
    
    1.1 <a href="#item1.1">Download & Explore Dataset (Notebook created)</a>
    
    1.2 <a href="#item1.2">Web page scraped</a>
    
    1.3 <a href="#item1.3">Data transformed into pandas dataframe</a>
    
    1.4 <a href="#item1.4">DataFrame cleaned and notebook annotate</a>
    

2. <a href="#item2">Explore Neighborhoods in Toronto</a>
    

3. <a href="#item3">Analyze Each Neighborhood</a>
    

4. <a href="#item4">Cluster Neighborhoods</a>
    

5. <a href="#item5">Examine Clusters</a>    
</font>
</div>

### *1.1 Download & Explore Dataset (Notebook created)*

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

import json # library to handle JSON files

import requests # library to handle requests

import ssl

import csv

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import matplotlib.cm as cm
import matplotlib.colors as colors # Matplotlib and associated plotting modules

from sklearn.cluster import KMeans # import k-means from clustering stage

!conda install -c conda-forge beautifulsoup4 --yes
from bs4 import BeautifulSoup # website scraping libraries and packages in Python from BeautifulSoup 

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim  # convert an address into latitude and longitude values

print("Libraries imported, csv imported, and SSL certificate errors ignored.")

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Libraries imported, csv imported, and SSL certificate errors ignored.


#### *1.2. Web page scraping*

This Wikipage has a list of postal codes in Canada where the first letter is **M**. Postal codes beginning with **M** are located within the city of Toronto in the province of Ontario.
Below here is the link to the dataset: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [3]:
# GET request
data = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(data, 'html.parser')

In [4]:
# 3 list in Array
postalCodeList = []
boroughList = []
neighborhoodList = []

# Locating the table and postal code
soup.find('table').find_all('tr')

# find all the rows of the table
soup.find('table').find_all('tr')

# for each row of the table, find all the table data
for row in soup.find('table').find_all('tr'):
    cells = row.find_all('td')

In [5]:
for row in soup.find('table').find_all('tr'):
    cells = row.find_all('td')
    if(len(cells) > 0):
        postalCodeList.append(cells[0].text)
        boroughList.append(cells[1].text)
        neighborhoodList.append(cells[2].text.rstrip('\n'))

### *1.3 Data transformed into pandas DataFrame*

In [6]:
toronto_df = pd.DataFrame({"PostalCode": postalCodeList,
                           "Borough": boroughList,
                           "Neighborhood": neighborhoodList})

toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [7]:
toronto_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287 entries, 0 to 286
Data columns (total 3 columns):
PostalCode      287 non-null object
Borough         287 non-null object
Neighborhood    287 non-null object
dtypes: object(3)
memory usage: 6.8+ KB


In [8]:
toronto_df.shape

(287, 3)

### *1.4 DataFrame cleaned and notebook annotate*

Processing the cells that have an assigned borough, and ignoring cells with "**Not assigned**" boroughs, like in rows 1 and 2 in the above table.

In [9]:
toronto_df_drop = toronto_df[toronto_df.Borough != "Not assigned"].reset_index(drop=True)
toronto_df_drop.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


In [10]:
# Grouping neighbors in same borough

toronto_df_grouped = toronto_df_drop.groupby(["PostalCode", "Borough"], as_index=False).agg(lambda x: ", ".join(x))
toronto_df_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [11]:
# Neighborhood WHICH IS "Not assigned", make the value the same as Borough

for index, row in toronto_df_grouped.iterrows():
    if row["Neighborhood"] == "Not assigned":
        row["Neighborhood"] = row["Borough"]
        
toronto_df_grouped.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [12]:
toronto_df.shape

(287, 3)

In [13]:
column_names = ["PostalCode", "Borough", "Neighborhood"]
test_df = pd.DataFrame(columns=column_names)

test_list = ["M5G", "M2H", "M4B", "M1J", "M4G", "M4M", "M1R", "M9V", "M9L", "M5V", "M1B", "M5A"]

for postcode in test_list:
    test_df = test_df.append(toronto_df_grouped[toronto_df_grouped["PostalCode"]==postcode], ignore_index=True)
    
test_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Woodbine Gardens, Parkview Hill"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Maryvale, Wexford"
7,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo..."


In [14]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
PostalCode      12 non-null object
Borough         12 non-null object
Neighborhood    12 non-null object
dtypes: object(3)
memory usage: 368.0+ bytes


In [15]:
toronto_df_new = toronto_df_grouped
toronto_df_new.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [16]:
toronto_df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
PostalCode      103 non-null object
Borough         103 non-null object
Neighborhood    103 non-null object
dtypes: object(3)
memory usage: 2.5+ KB


In [17]:
toronto_df_new.shape

(103, 3)

### *End of Question 1:* 

## **Question 2:**

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

Since Google Maps Geocoding API is now charging for their API; I won't be using it, I rather be using the Geocoder Python package instead:
https://geocoder.readthedocs.io/index.html.


The problem with this Package is one have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code.

Given that this package can be very unreliable, in case one is not able to get the geographical coordinates of the neighborhoods using the Geocoder package.

I will then be using this link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

### *2.1. Reading csv file into Panda*

In [18]:
coordinates = pd.read_csv('https://cocl.us/Geospatial_data')
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [19]:
coordinates.rename(columns={"Postal Code": "PostalCode"}, inplace=True)
coordinates.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [20]:
# Merging the data
toronto_df_new = toronto_df_grouped.merge(coordinates, on="PostalCode", how="left")
toronto_df_new.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### *2.2. Used the file to create the requested dataframe*

In [21]:
# Testing to see coordinates are added

column_names = ["PostalCode", "Borough", "Neighborhood", "Latitude", "Longitude"]
test_df = pd.DataFrame(columns=column_names)

test_list = ["M5G", "M2H", "M4B", "M1J", "M4G", "M4M", "M1R", "M9V", "M9L", "M5V", "M1B", "M5A"]

for postcode in test_list:
    test_df = test_df.append(toronto_df_new[toronto_df_new["PostalCode"]==postcode], ignore_index=True)
    
test_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
1,M2H,North York,Hillcrest Village,43.803762,-79.363452
2,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
3,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
4,M4G,East York,Leaside,43.70906,-79.363452
5,M4M,East Toronto,Studio District,43.659526,-79.340923
6,M1R,Scarborough,"Maryvale, Wexford",43.750072,-79.295849
7,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437
8,M9L,North York,Humber Summit,43.756303,-79.565963
9,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo...",43.628947,-79.39442


In [22]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 5 columns):
PostalCode      12 non-null object
Borough         12 non-null object
Neighborhood    12 non-null object
Latitude        12 non-null float64
Longitude       12 non-null float64
dtypes: float64(2), object(3)
memory usage: 560.0+ bytes


In [23]:
# Longitude and Latitude

address = 'Toronto'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))


The geograpical coordinate of Toronto are 43.653963, -79.387207.


### **End of Question 2:**

## **Question 3:**

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

1. To add enough Markdown cells to explain what you decided to do and to report any observations you make.

2. To generate maps to visualize your neighborhoods and how they cluster together.

In [24]:
# Installing map rendering library

!conda install -c conda-forge folium=0.5.0 --yes
import folium

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    altair-4.0.1               |             py_0         575 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         673 KB

The following NEW packages will be INSTALLED:

    altair:  4.0.1-py_0 conda-forge
    branca:  0.4.0-py_0 conda-forge
    folium:  0.5.0-py_0 conda-forge
    vincent: 0.4.4-py_1 conda-forge


Downloading and Extracting Packages
vincent-0.4.4        | 28 KB     | #####

### *3.1. Exploring the Neighborhoods in Toronto*

In [25]:
borough_names = list(toronto_df_new.Borough.unique())

borough_with_toronto = []

for x in borough_names:
    if "toronto" in x.lower():
        borough_with_toronto.append(x)
        
borough_with_toronto

['East Toronto', 'Central Toronto', 'Downtown Toronto', 'West Toronto']

In [26]:
# create a new DataFrame with only boroughs that contain the word Toronto

toronto_df_new = toronto_df_new[toronto_df_new['Borough'].isin(borough_with_toronto)].reset_index(drop=True)
print(toronto_df_new.shape)
toronto_df_new.head()

(39, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [27]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df_new['Latitude'], toronto_df_new['Longitude'], toronto_df_new['Borough'], toronto_df_new['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
    
map_toronto

### *3.2. Inputing Foursquare Credentials and Version*

In [28]:
CLIENT_ID = 'EYEKZFP1CNL513TIIUUKPHC5453ERKKJUHXVQJOQFYXH0GTW' # your Foursquare ID
CLIENT_SECRET = 'SPT0KBUTWB5LLURSTZJO0VOUMKHJ3GFC3DPNA3JDAZQVBTK2'  # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: EYEKZFP1CNL513TIIUUKPHC5453ERKKJUHXVQJOQFYXH0GTW
CLIENT_SECRET:SPT0KBUTWB5LLURSTZJO0VOUMKHJ3GFC3DPNA3JDAZQVBTK2


In [29]:
radius = 500
LIMIT = 100

venues = []

for lat, long, post, borough, neighborhood in zip(toronto_df_new['Latitude'], toronto_df_new['Longitude'], toronto_df_new['PostalCode'], toronto_df_new['Borough'], 
                                                  toronto_df_new['Neighborhood']):
    url = "https://api.foursquare.com/v2/venues/explore?client_id=EYEKZFP1CNL513TIIUUKPHC5453ERKKJUHXVQJOQFYXH0GTW&client_secret=SPT0KBUTWB5LLURSTZJO0VOUMKHJ3GFC3DPNA3JDAZQVBTK2&v=20180605 \
     &ll=43.653963,-79.387207&radius=500&limit=100".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    for venue in results:
        venues.append((
            post, 
            borough,
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

### *Converting venues list to new DataFrame*

In [30]:
venues_df = pd.DataFrame(venues)


venues_df.columns = ['PostalCode', 'Borough', 'Neighborhood', 'BoroughLatitude', 'BoroughLongitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(2808, 9)


Unnamed: 0,PostalCode,Borough,Neighborhood,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,Downtown Toronto,43.653232,-79.385296,Neighborhood
1,M4E,East Toronto,The Beaches,43.676357,-79.293031,Japango,43.655268,-79.385165,Sushi Restaurant
2,M4E,East Toronto,The Beaches,43.676357,-79.293031,Poke Guys,43.654895,-79.385052,Poke Place
3,M4E,East Toronto,The Beaches,43.676357,-79.293031,Rolltation,43.654918,-79.387424,Japanese Restaurant
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,Sansotei Ramen 三草亭,43.655157,-79.386501,Ramen Restaurant


In [31]:
# Checking how many VENUES are returned

venues_df.groupby(["PostalCode", "Borough", "Neighborhood"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,BoroughLatitude,BoroughLongitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
PostalCode,Borough,Neighborhood,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
M4E,East Toronto,The Beaches,72,72,72,72,72,72
M4K,East Toronto,"The Danforth West, Riverdale",72,72,72,72,72,72
M4L,East Toronto,"The Beaches West, India Bazaar",72,72,72,72,72,72
M4M,East Toronto,Studio District,72,72,72,72,72,72
M4N,Central Toronto,Lawrence Park,72,72,72,72,72,72
M4P,Central Toronto,Davisville North,72,72,72,72,72,72
M4R,Central Toronto,North Toronto West,72,72,72,72,72,72
M4S,Central Toronto,Davisville,72,72,72,72,72,72
M4T,Central Toronto,"Moore Park, Summerhill East",72,72,72,72,72,72
M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West",72,72,72,72,72,72


In [32]:
# Analysing each area

# one hot encoding
toronto_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add postal, borough and neighborhood column back to dataframe
toronto_onehot['PostalCode'] = venues_df['PostalCode'] 
toronto_onehot['Borough'] = venues_df['Borough'] 
toronto_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move postal, borough and neighborhood column to the first column
fixed_columns = list(toronto_onehot.columns[-3:]) + list(toronto_onehot.columns[:-3])
toronto_onehot = toronto_onehot[fixed_columns]

print(toronto_onehot.shape)
toronto_onehot.head()

(2808, 53)


Unnamed: 0,PostalCode,Borough,Neighborhoods,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Bar,Breakfast Spot,Bubble Tea Shop,...,Seafood Restaurant,Smoke Shop,Speakeasy,Sushi Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Toy / Game Store,University,Vegetarian / Vegan Restaurant
0,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M4E,East Toronto,The Beaches,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
# Grouping rows by neighborhood and by taking the MEAN of the frequency of occurence of each category

toronto_grouped = toronto_onehot.groupby(["PostalCode", "Borough", "Neighborhoods"]).mean().reset_index()

print(toronto_grouped.shape)
toronto_grouped


(39, 53)


Unnamed: 0,PostalCode,Borough,Neighborhoods,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Bar,Breakfast Spot,Bubble Tea Shop,...,Seafood Restaurant,Smoke Shop,Speakeasy,Sushi Restaurant,Tapas Restaurant,Tea Room,Thai Restaurant,Toy / Game Store,University,Vegetarian / Vegan Restaurant
0,M4E,East Toronto,The Beaches,0.013889,0.041667,0.013889,0.013889,0.027778,0.041667,0.013889,...,0.013889,0.013889,0.013889,0.055556,0.013889,0.013889,0.013889,0.013889,0.013889,0.027778
1,M4K,East Toronto,"The Danforth West, Riverdale",0.013889,0.041667,0.013889,0.013889,0.027778,0.041667,0.013889,...,0.013889,0.013889,0.013889,0.055556,0.013889,0.013889,0.013889,0.013889,0.013889,0.027778
2,M4L,East Toronto,"The Beaches West, India Bazaar",0.013889,0.041667,0.013889,0.013889,0.027778,0.041667,0.013889,...,0.013889,0.013889,0.013889,0.055556,0.013889,0.013889,0.013889,0.013889,0.013889,0.027778
3,M4M,East Toronto,Studio District,0.013889,0.041667,0.013889,0.013889,0.027778,0.041667,0.013889,...,0.013889,0.013889,0.013889,0.055556,0.013889,0.013889,0.013889,0.013889,0.013889,0.027778
4,M4N,Central Toronto,Lawrence Park,0.013889,0.041667,0.013889,0.013889,0.027778,0.041667,0.013889,...,0.013889,0.013889,0.013889,0.055556,0.013889,0.013889,0.013889,0.013889,0.013889,0.027778
5,M4P,Central Toronto,Davisville North,0.013889,0.041667,0.013889,0.013889,0.027778,0.041667,0.013889,...,0.013889,0.013889,0.013889,0.055556,0.013889,0.013889,0.013889,0.013889,0.013889,0.027778
6,M4R,Central Toronto,North Toronto West,0.013889,0.041667,0.013889,0.013889,0.027778,0.041667,0.013889,...,0.013889,0.013889,0.013889,0.055556,0.013889,0.013889,0.013889,0.013889,0.013889,0.027778
7,M4S,Central Toronto,Davisville,0.013889,0.041667,0.013889,0.013889,0.027778,0.041667,0.013889,...,0.013889,0.013889,0.013889,0.055556,0.013889,0.013889,0.013889,0.013889,0.013889,0.027778
8,M4T,Central Toronto,"Moore Park, Summerhill East",0.013889,0.041667,0.013889,0.013889,0.027778,0.041667,0.013889,...,0.013889,0.013889,0.013889,0.055556,0.013889,0.013889,0.013889,0.013889,0.013889,0.027778
9,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",0.013889,0.041667,0.013889,0.013889,0.027778,0.041667,0.013889,...,0.013889,0.013889,0.013889,0.055556,0.013889,0.013889,0.013889,0.013889,0.013889,0.027778


### *3.3. New DataFrame and displaying top 10 venues of PostalCode*

In [34]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
areaColumns = ['PostalCode', 'Borough', 'Neighborhoods']
freqColumns = []
for ind in np.arange(num_top_venues):
    try:
        freqColumns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        freqColumns.append('{}th Most Common Venue'.format(ind+1))
columns = areaColumns+freqColumns

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['PostalCode'] = toronto_grouped['PostalCode']
neighborhoods_venues_sorted['Borough'] = toronto_grouped['Borough']
neighborhoods_venues_sorted['Neighborhoods'] = toronto_grouped['Neighborhoods']

for ind in np.arange(toronto_grouped.shape[0]):
    row_categories = toronto_grouped.iloc[ind, :].iloc[3:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    neighborhoods_venues_sorted.iloc[ind, 3:] = row_categories_sorted.index.values[0:num_top_venues]

# neighborhoods_venues_sorted.sort_values(freqColumns, inplace=True)
print(neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted


(39, 13)


Unnamed: 0,PostalCode,Borough,Neighborhoods,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
1,M4K,East Toronto,"The Danforth West, Riverdale",Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
2,M4L,East Toronto,"The Beaches West, India Bazaar",Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
3,M4M,East Toronto,Studio District,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
4,M4N,Central Toronto,Lawrence Park,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
5,M4P,Central Toronto,Davisville North,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
6,M4R,Central Toronto,North Toronto West,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
7,M4S,Central Toronto,Davisville,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
8,M4T,Central Toronto,"Moore Park, Summerhill East",Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
9,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant


### *3.4. Clustering*

In [35]:
#Clustring

kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop(["PostalCode", "Borough", "Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

  return_n_iter=True)


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [36]:
#create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

toronto_merged = toronto_df_new.copy()

# add clustering labels
toronto_merged["Cluster Labels"] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.drop(["Borough", "Neighborhoods"], 1).set_index("PostalCode"), on="PostalCode")

print(toronto_merged.shape)
toronto_merged.head() # check the last columns!


(39, 16)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant


In [37]:
# sort the results by Cluster Labels

print(toronto_merged.shape)
toronto_merged.sort_values(["Cluster Labels"], inplace=True)
toronto_merged

(39, 16)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
21,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
22,M5N,Central Toronto,Roselawn,43.711695,-79.416936,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
23,M5P,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,-79.411307,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
24,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
25,M5S,Downtown Toronto,"Harbord, University of Toronto",43.662696,-79.400049,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
26,M5T,Downtown Toronto,"Chinatown, Grange Park, Kensington Market",43.653206,-79.400049,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
27,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo...",43.628947,-79.39442,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
20,M5K,Downtown Toronto,"Design Exchange, Toronto Dominion Centre",43.647177,-79.381576,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
28,M5W,Downtown Toronto,Stn A PO Boxes 25 The Esplanade,43.646435,-79.374846,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant


### *3.5. Visualizing Clusters*

In [38]:
# Visualizing the Clusters

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, post, bor, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['PostalCode'], toronto_merged['Borough'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup('{} ({}): {} - Cluster {}'.format(bor, post, poi, cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### *3.6 Examine Clusters*

**Cluster 1**

In [39]:
# Cluster 1

toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + \
                                                                                 list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
21,Downtown Toronto,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
22,Central Toronto,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
23,Central Toronto,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
24,Central Toronto,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
25,Downtown Toronto,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
26,Downtown Toronto,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
27,Downtown Toronto,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
20,Downtown Toronto,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant
28,Downtown Toronto,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Café,Art Gallery,Breakfast Spot,Vegetarian / Vegan Restaurant,Hotel,Park,Ramen Restaurant


**Cluster 2**

In [40]:
# Cluster 2

toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + \
                                                                                 list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


**Cluster 3**

In [41]:
# Cluster 3

toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + \
                                                                                 list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


**Cluster 4**

In [42]:
# CLuster 4

toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + \
                                                                                 list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


**Cluster 5**

In [43]:
# Cluster 5

toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + \
                                                                                 list(range(5, toronto_merged.shape[1]))]]


Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


### **Conclusion**

From examination, most of the neighborhoods fall into **Cluster 1** which are the areas with cafe, restaurants, supermarkets etc.