## _<center> Webscraping (Part-1)</center>_

### Importing necassary packages 

In [3]:
!pip install folium

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/a4/f0/44e69d50519880287cc41e7c8a6acc58daa9a9acf5f6afc52bcc70f69a6d/folium-0.11.0-py2.py3-none-any.whl (93kB)
[K     |████████████████████████████████| 102kB 6.4MB/s ta 0:00:011
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


In [4]:
import pandas as pd
import folium
import numpy as np
import warnings
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
warnings.filterwarnings('ignore')

Importing necessary packages for webscraping

In [5]:
from bs4 import BeautifulSoup as bs
import requests
import re

### Obtaining dataset from Wikipedia

In [6]:
#wikipedia url
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [7]:
#getting website data
wiki_response = requests.get(wiki_url).content
htmlsoup = bs(wiki_response, 'html.parser')

#### Getting web content - Toronto table 

In [8]:
toronto_table = htmlsoup.table.text
#see the table
toronto_table[:100]

'\n\nPostal Code\n\nBorough\n\nNeighborhood\n\n\nM1A\n\nNot assigned\n\nNot assigned\n\n\nM2A\n\nNot assigned\n\nNot assi'

####  Making the Toronto Postal Code dataframe

In [9]:
#removing escape sequence
table_values = re.sub('\n',' ', toronto_table)
table_values[:100]

'  Postal Code  Borough  Neighborhood   M1A  Not assigned  Not assigned   M2A  Not assigned  Not assi'

In [10]:
table_list = table_values.split('   ')
table_list[:300]

['  Postal Code  Borough  Neighborhood',
 'M1A  Not assigned  Not assigned',
 'M2A  Not assigned  Not assigned',
 'M3A  North York  Parkwoods',
 'M4A  North York  Victoria Village',
 'M5A  Downtown Toronto  Regent Park, Harbourfront',
 'M6A  North York  Lawrence Manor, Lawrence Heights',
 "M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government",
 'M8A  Not assigned  Not assigned',
 'M9A  Etobicoke  Islington Avenue, Humber Valley Village',
 'M1B  Scarborough  Malvern, Rouge',
 'M2B  Not assigned  Not assigned',
 'M3B  North York  Don Mills',
 'M4B  East York  Parkview Hill, Woodbine Gardens',
 'M5B  Downtown Toronto  Garden District, Ryerson',
 'M6B  North York  Glencairn',
 'M7B  Not assigned  Not assigned',
 'M8B  Not assigned  Not assigned',
 'M9B  Etobicoke  West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale',
 'M1C  Scarborough  Rouge Hill, Port Union, Highland Creek',
 'M2C  Not assigned  Not assigned',
 'M3C  North York  Don Mills',
 'M4C  East Yo

We dont need leading or trailing white spaces in each entry like for example in " M2A Not assigned" and also the column names.

In [11]:
clean_table_list = table_list[1:]
clean_table_list = [entry.strip() for entry in clean_table_list]
clean_table_list[:20]

['M1A  Not assigned  Not assigned',
 'M2A  Not assigned  Not assigned',
 'M3A  North York  Parkwoods',
 'M4A  North York  Victoria Village',
 'M5A  Downtown Toronto  Regent Park, Harbourfront',
 'M6A  North York  Lawrence Manor, Lawrence Heights',
 "M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government",
 'M8A  Not assigned  Not assigned',
 'M9A  Etobicoke  Islington Avenue, Humber Valley Village',
 'M1B  Scarborough  Malvern, Rouge',
 'M2B  Not assigned  Not assigned',
 'M3B  North York  Don Mills',
 'M4B  East York  Parkview Hill, Woodbine Gardens',
 'M5B  Downtown Toronto  Garden District, Ryerson',
 'M6B  North York  Glencairn',
 'M7B  Not assigned  Not assigned',
 'M8B  Not assigned  Not assigned',
 'M9B  Etobicoke  West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale',
 'M1C  Scarborough  Rouge Hill, Port Union, Highland Creek',
 'M2C  Not assigned  Not assigned']

So now we need to split the elements of each entry and store it into the the column of the dataframe

In [12]:
clean_table_list_v2 = [entry.split('  ') for entry in clean_table_list]
clean_table_list_v2

[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Regent Park, Harbourfront'],
 ['M6A', 'North York', 'Lawrence Manor, Lawrence Heights'],
 ['M7A', 'Downtown Toronto', "Queen's Park, Ontario Provincial Government"],
 ['M8A', 'Not assigned', 'Not assigned'],
 ['M9A', 'Etobicoke', 'Islington Avenue, Humber Valley Village'],
 ['M1B', 'Scarborough', 'Malvern, Rouge'],
 ['M2B', 'Not assigned', 'Not assigned'],
 ['M3B', 'North York', 'Don Mills'],
 ['M4B', 'East York', 'Parkview Hill, Woodbine Gardens'],
 ['M5B', 'Downtown Toronto', 'Garden District, Ryerson'],
 ['M6B', 'North York', 'Glencairn'],
 ['M7B', 'Not assigned', 'Not assigned'],
 ['M8B', 'Not assigned', 'Not assigned'],
 ['M9B',
  'Etobicoke',
  'West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale'],
 ['M1C', 'Scarborough', 'Rouge Hill, Port Union, Highland Creek'],
 ['M

Now each entry is ready

In [13]:
postal_codes = [code[0] for code in clean_table_list_v2[:-1]]
postal_codes[:10]

['M1A', 'M2A', 'M3A', 'M4A', 'M5A', 'M6A', 'M7A', 'M8A', 'M9A', 'M1B']

In [14]:
borough = [bor[1] for bor in clean_table_list_v2[:-1]]
borough[:10]

['Not assigned',
 'Not assigned',
 'North York',
 'North York',
 'Downtown Toronto',
 'North York',
 'Downtown Toronto',
 'Not assigned',
 'Etobicoke',
 'Scarborough']

In [15]:
neighborhood = ['na' if neigh[1] == 'Not assigned' else neigh[2] for neigh in clean_table_list_v2[:-1]]
neighborhood[:10]

['na',
 'na',
 'Parkwoods',
 'Victoria Village',
 'Regent Park, Harbourfront',
 'Lawrence Manor, Lawrence Heights',
 "Queen's Park, Ontario Provincial Government",
 'na',
 'Islington Avenue, Humber Valley Village',
 'Malvern, Rouge']

In [16]:
toronto_df = pd.DataFrame(columns = ['PostalCode', 'Borough', 'Neighborhood'])
toronto_df['PostalCode'] = postal_codes
toronto_df['Borough'] = borough
toronto_df['Neighborhood'] = neighborhood
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,na
1,M2A,Not assigned,na
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### Ignoring cells with Borough = 'Not assigned' 

In [17]:
toronto_df = toronto_df[toronto_df.Borough != 'Not assigned'].reset_index(drop = True)
toronto_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Checking whether there a borough with no neighborhood assigned or not

In [18]:
toronto_df.Neighborhood[toronto_df.Neighborhood == ''] #all neighborhoods with boroughs assigned do have neighborhood value

Series([], Name: Neighborhood, dtype: object)

### Checking the dimesions of the df - Final step

In [19]:
toronto_df.shape

(103, 3)

# _<center> End of Webscraping (Part-1) </center>_

***

# <center> Adding Coordinates of Neighborhoods (Part-2) </center>

##### Since the geocoder package is taking too long - we will use the csv file provided instead 

In [20]:
latlong_df = pd.read_csv('http://cocl.us/Geospatial_data')

In [21]:
latlong_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Checking whether there is any repetition of postal code in `toronto_df`

In [22]:
toronto_df.PostalCode.value_counts().sum() == toronto_df.shape[0] #there no recurring postal codes

True

Checking whether there is any repetition of postal code in `latlong_df`

In [23]:
latlong_df['Postal Code'].value_counts().sum() == latlong_df.shape[0] #there is no recurring postal codes

True

Does this `latlong` dataframe has all the postal code in our `toronto` dataframe?

In [24]:
set(toronto_df.PostalCode) == set(latlong_df['Postal Code'])

True

From above code it is clear that the `latlong_df` has all the postal codes we need. We can also see the dimensions of the dataset below.

In [25]:
toronto_df.shape[0], latlong_df.shape[0]

(103, 103)

In [26]:
final_df = pd.merge(toronto_df,latlong_df.rename(columns = {'Postal Code': 'PostalCode'}))
final_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


# <center> End of Part-2 </center>

# <center> Clustering Toronto's Neighborhoods (Part-3) </center>

In [27]:
is_toronto_mask = [True if bor.find('Toronto') != -1 else False for bor in final_df.Borough]

In [28]:
toronto_data = final_df[is_toronto_mask].reset_index(drop = True)
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


Lets see how many neighborhood does Toronto have?

In [29]:
toronto_data.Neighborhood.nunique() #Toronto has 39 neighborhoods

39

Even though the above code shows there are 39 neighborhoods, there are more than that as we have more than one neighborhood listed per `PostalCode` which adds to the complexity of the project. We can do 2 things :
1. __Method - 1__ : keep the first Neighborhood and continue analysis with it or : __so we will be dealing with 39 neighborhoods only__
2. __Method - 2__ : split the neihborhood entries with multiple values and replicate the other column values and add it to the table - this adds to the number of columns and thus we will have __more and actual number of neighborhoods captured in the dataframe__. But since the latitude and longitude is available per `PostalCode` we can eliminate multiple neighborhood.

### This project implements _Method - 1_ : Using only the first neighborhood available in the `toronto_data` dataframe

In [30]:
toronto_full = toronto_data.copy()

In [31]:
first_neighbor = [neighbor.split(',')[0] for neighbor in toronto_full.Neighborhood]
first_neighbor[:10]

['Regent Park',
 "Queen's Park",
 'Garden District',
 'St. James Town',
 'The Beaches',
 'Berczy Park',
 'Central Bay Street',
 'Christie',
 'Richmond',
 'Dufferin']

In [32]:
toronto_full.Neighborhood = first_neighbor
toronto_full.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,Regent Park,43.65426,-79.360636
1,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
2,M5B,Downtown Toronto,Garden District,43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031
5,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,M6G,Downtown Toronto,Christie,43.669542,-79.422564
8,M5H,Downtown Toronto,Richmond,43.650571,-79.384568
9,M6H,West Toronto,Dufferin,43.669005,-79.442259


#### Visualizing Toronto neighborhoods

In [33]:
toronto_full.dtypes

PostalCode       object
Borough          object
Neighborhood     object
Latitude        float64
Longitude       float64
dtype: object

In [34]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [35]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_full['Latitude'], toronto_full['Longitude'], toronto_full['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Accessing location data using Foursquare

In [36]:
CLIENT_ID = 'QIMCJRTDSNBAB2LVLS0EDC5SXUA5ROYZKZZBNRES5Q44YCIM' # your Foursquare ID
CLIENT_SECRET = 'S0OJCSLL5S4EL1PJFY3VJV0DG334CULSZ0KZG4NA00C5ZZFG' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [37]:
def getNearbyVenues(neighbor_name, latitude, longitude, radius = 500, limit = 100):
    
    venues_list = []
    for name, lat, lng in zip(neighbor_name, latitude, longitude):
        #make url
        fs_url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            limit
        )
        
        #get result
        fs_result = requests.get(fs_url).json()['response']['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in fs_result])
        
        #converting each entry
        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
        
    return nearby_venues

In [38]:
toronto_venues = getNearbyVenues(toronto_full.Neighborhood, 
                                 toronto_full.Latitude,
                                 toronto_full.Longitude
                                )

Lets see the nearby venues of Toronto

In [39]:
toronto_venues.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Regent Park,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Regent Park,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Regent Park,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
3,Regent Park,43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
4,Regent Park,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
5,Regent Park,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
6,Regent Park,43.65426,-79.360636,Corktown Common,43.655618,-79.356211,Park
7,Regent Park,43.65426,-79.360636,The Extension Room,43.653313,-79.359725,Gym / Fitness Center
8,Regent Park,43.65426,-79.360636,The Distillery Historic District,43.650244,-79.359323,Historic Site
9,Regent Park,43.65426,-79.360636,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot


##  Analyze neighborhoods

In [40]:
toronto_venues.shape[0]

1615

So we have __1616__ venues of interest for all neighborhoods of Toronto

How many unique categories are avialable?

In [41]:
toronto_venues['Venue Category'].nunique()

236

We have __237__ unique categories available

#### Representing each neighborhoods as a vector of venue categories

In [42]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix= '', prefix_sep= '')
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood']
toronto_onehot.head()

Unnamed: 0,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Rearranging columns order

In [43]:
total_columns = toronto_onehot.columns.tolist()
total_columns.remove('Neighborhood')

In [44]:
column_order = ['Neighborhood'] + total_columns
column_order[:4]

['Neighborhood', 'Afghan Restaurant', 'Airport', 'Airport Food Court']

In [45]:
toronto_onehot = toronto_onehot[column_order]
toronto_onehot.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Regent Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Regent Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Regent Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Regent Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Regent Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
toronto_grouped = toronto_onehot.groupby('Neighborhood', as_index= False).mean()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Brockton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Business reply mail Processing Centre,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824
3,CN Tower,0.0,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.015873,0.0,0.0,0.015873,0.0,0.0,0.0,0.015873


Function that returns top 'n' venues for a neighborhood

In [47]:
# a = toronto_onehot.iloc[1,:] #dataframe notation
# a.iloc[1:]
# # a

In [48]:
def return_top_venues(neighborhood, num_top_venues):
    #get venues only
    venue_list = neighborhood.iloc[1:] 
    venue_list.sort_values(ascending = False, inplace = True)
    return venue_list.index.values[:num_top_venues]

In [49]:
num_top_venues = 10

extensions = ['st','nd','rd']
colnames = []
for colnum in range(num_top_venues):
    try:
        colnames.append('{}{} Most common venue'.format(colnum+1, extensions[colnum]))
    except:
        colnames.append('{}th Most common venue'.format(colnum+1))

colnames = ['Neighborhood'] + colnames
toronto_sorted = pd.DataFrame(columns= colnames)
toronto_sorted['Neighborhood'] = toronto_grouped['Neighborhood']
for row_num in range(toronto_grouped.shape[0]):
    first_row = toronto_grouped.iloc[row_num, :]
    toronto_sorted.iloc[row_num, 1:] = return_top_venues(first_row, num_top_venues)

toronto_sorted.head()

Unnamed: 0,Neighborhood,1st Most common venue,2nd Most common venue,3rd Most common venue,4th Most common venue,5th Most common venue,6th Most common venue,7th Most common venue,8th Most common venue,9th Most common venue,10th Most common venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Cheese Shop,Café,Restaurant,Beer Bar,Seafood Restaurant,Greek Restaurant,Gourmet Shop
1,Brockton,Café,Coffee Shop,Breakfast Spot,Stadium,Bar,Italian Restaurant,Bakery,Restaurant,Climbing Gym,Furniture / Home Store
2,Business reply mail Processing Centre,Yoga Studio,Auto Workshop,Skate Park,Light Rail Station,Smoke Shop,Spa,Farmers Market,Fast Food Restaurant,Burrito Place,Restaurant
3,CN Tower,Airport Service,Airport Lounge,Airport Terminal,Harbor / Marina,Bar,Plane,Coffee Shop,Rental Car Location,Sculpture Garden,Boat or Ferry
4,Central Bay Street,Coffee Shop,Italian Restaurant,Japanese Restaurant,Sandwich Place,Café,Department Store,Salad Place,Bubble Tea Shop,Burger Joint,Korean Restaurant


In [50]:
kclusters = 5
toronto_clustering = toronto_grouped.drop('Neighborhood', axis = 1)
kmeans = KMeans(n_clusters= kclusters, random_state = 10).fit(toronto_clustering)

In [51]:
kmeans.labels_[:10]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

#### Adding cluster labels to Toronto dataset 

In [52]:
toronto_grouped.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Brockton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Business reply mail Processing Centre,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824
3,CN Tower,0.0,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.015873,0.0,0.0,0.015873,0.0,0.0,0.0,0.015873


In [53]:
toronto_sorted.insert(0,'Cluster_labels',kmeans.labels_)

In [54]:
toronto_sorted.head()

Unnamed: 0,Cluster_labels,Neighborhood,1st Most common venue,2nd Most common venue,3rd Most common venue,4th Most common venue,5th Most common venue,6th Most common venue,7th Most common venue,8th Most common venue,9th Most common venue,10th Most common venue
0,1,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Cheese Shop,Café,Restaurant,Beer Bar,Seafood Restaurant,Greek Restaurant,Gourmet Shop
1,1,Brockton,Café,Coffee Shop,Breakfast Spot,Stadium,Bar,Italian Restaurant,Bakery,Restaurant,Climbing Gym,Furniture / Home Store
2,1,Business reply mail Processing Centre,Yoga Studio,Auto Workshop,Skate Park,Light Rail Station,Smoke Shop,Spa,Farmers Market,Fast Food Restaurant,Burrito Place,Restaurant
3,1,CN Tower,Airport Service,Airport Lounge,Airport Terminal,Harbor / Marina,Bar,Plane,Coffee Shop,Rental Car Location,Sculpture Garden,Boat or Ferry
4,1,Central Bay Street,Coffee Shop,Italian Restaurant,Japanese Restaurant,Sandwich Place,Café,Department Store,Salad Place,Bubble Tea Shop,Burger Joint,Korean Restaurant


In [55]:
toronto_merged = toronto_full.join(toronto_sorted.set_index('Neighborhood'), on = 'Neighborhood')
toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster_labels,1st Most common venue,2nd Most common venue,3rd Most common venue,4th Most common venue,5th Most common venue,6th Most common venue,7th Most common venue,8th Most common venue,9th Most common venue,10th Most common venue
0,M5A,Downtown Toronto,Regent Park,43.65426,-79.360636,1,Coffee Shop,Bakery,Pub,Park,Theater,Breakfast Spot,Restaurant,Café,Spa,Shoe Store
1,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494,1,Coffee Shop,Sushi Restaurant,Yoga Studio,Bank,Beer Bar,Smoothie Shop,Sandwich Place,Burger Joint,Burrito Place,Café
2,M5B,Downtown Toronto,Garden District,43.657162,-79.378937,1,Clothing Store,Coffee Shop,Italian Restaurant,Café,Bubble Tea Shop,Japanese Restaurant,Middle Eastern Restaurant,Cosmetics Shop,Theater,Hotel
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1,Coffee Shop,Café,Restaurant,Italian Restaurant,Cocktail Bar,Gastropub,American Restaurant,Bakery,Pharmacy,Park
4,M4E,East Toronto,The Beaches,43.676357,-79.293031,1,Trail,Health Food Store,Pub,Yoga Studio,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop


####  Visualizing clusters

In [56]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
# markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster_labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Naming the clusters

In [86]:
#a function to get the top 2 venues from each cluster
def get_top2_spot(cluster):
    spot = cluster['1st Most common venue'].value_counts().index.values[0] + ' and ' + cluster['2nd Most common venue'].value_counts().index.values[0]
    return spot

##### Cluster 1 

In [87]:
c1 = toronto_merged.loc[toronto_merged['Cluster_labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
get_top2_spot(c1)

'Jewelry Store and Trail'

Thus Cluster - 1 can be labelled as __ornament cluster or luxury item cluster__.

##### Cluster 2 

In [88]:
c2 = toronto_merged.loc[toronto_merged['Cluster_labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
get_top2_spot(c2)

'Coffee Shop and Café'

Thus Cluster -2 can be labelled as __refreshment or recreation cluster__

##### Cluster 3

In [91]:
c3 = toronto_merged.loc[toronto_merged['Cluster_labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
get_top2_spot(c3)

'Garden and Home Service'

Cluster 3 can be labelled as __Home decor cluster__

##### Cluster 4

In [92]:
c4 = toronto_merged.loc[toronto_merged['Cluster_labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
get_top2_spot(c4)

'Park and Restaurant'

Cluster 4 can be labelled as __Meet-up cluster__

In [93]:
c5 = toronto_merged.loc[toronto_merged['Cluster_labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
get_top2_spot(c5)

'Park and Bus Line'

Cluster - 5 can be labelled as __Convention cluster__

## Conclusion 

Thus, 5 different clusters in Toronto have been identified.

1. Ornament Cluster
2. Refreshment Cluster
3. Home decor Cluster
4. Meet-up Cluster
5. Convention Cluster

# <center> End of 3rd part</center>