## Scraping Wikipedia page to get neighborhoods in Toronto and segmenting them into clusters

__1. First step__ is to install all needed packages as well as necessary libraries.    
Web scraping is done using the BeautifulSoup library, where the lxml parser is applied.

In [1]:
!pip install beautifulsoup4

Requirement not upgraded as not directly required: beautifulsoup4 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages


In [2]:
!pip install lxml

Requirement not upgraded as not directly required: lxml in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages


In [3]:
import lxml
import pandas as pd
import requests
from bs4 import BeautifulSoup

Using the BeautifulSoup library, the html code is extracted and printed with the prettify option to see it indented.

In [4]:
website_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(website_link).text

soup = BeautifulSoup(source, 'lxml')
#print(soup.prettify())

In [5]:
#Let's see if we got the right article:
title = soup.title.text
print(title.split(' - ')[0])

List of postal codes of Canada: M


__Next__, let's get the information from the table in the Wikipedia article.

In [6]:
#get the info from the table
table = soup.find('table', class_='wikitable sortable')
#print(table.prettify())

__Next__, let's put all the information from each row that we see into an empty list that we will then transform into a dataframe, giving the right columns names.

Meanwhile, rows that don't have an assigned borough are removed, as well as such that have 0 values.

In [7]:
info_list = []
for row in table.find_all('tr'):
    data=row.find_all('td')
    info_list.append([i.text.strip() for i in data])
    
column_names = ['PostalCode', 'Borough', 'Neighborhood']
neighborhoods_df = pd.DataFrame(columns= column_names,data=info_list[1:])

neighborhoods_df=neighborhoods_df[neighborhoods_df.Borough != 'Not assigned']
neighborhoods_df=neighborhoods_df[neighborhoods_df.Borough != 0]
neighborhoods_df.reset_index(inplace=True, drop=True)

neighborhoods_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Let's see some infos about the current data:

In [8]:
print('There are:')
print('  {} Postal codes'.format(neighborhoods_df['PostalCode'].unique().shape[0]))
print('  {} Boroughs'.format(neighborhoods_df['Borough'].unique().shape[0]))
print('  {} Neighborhoods'.format(neighborhoods_df['Neighborhood'].unique().shape[0]))

There are:
  103 Postal codes
  11 Boroughs
  209 Neighborhoods


However, there are still neighborhoods that have a 'Not assigned' value. Need to fix those

In [9]:
neighborhoods_df.loc[neighborhoods_df.Neighborhood == 'Not assigned', 'Neighborhood'] = neighborhoods_df.Borough
neighborhoods_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Since we see, that there are more neighborhoods than postal codes, we combine the rows with the same postal code into one, separating the neighborhoods with a comma in the last column.

In [10]:
neighborhoods_df=neighborhoods_df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()
neighborhoods_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Let's see how many rows the dataframe now has. The number should correspond to the count of unique postal codes from the beginning. (103)

In [11]:
neighborhoods_df.shape

(103, 3)

After extracting the data for the neighborhoods and creating a proper dataframe with those with the postal code, now it's time to gather their coordinates in order to segment them into clusters.

The coordinates will be extracted from the following source: http://cocl.us/Geospatial_data and directly converted into e pandas dataframe. 

In [12]:
location_data=pd.read_csv('http://cocl.us/Geospatial_data')
location_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [13]:
#rename the postal code column to match with the other dataframe
location_data.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)

__Next__, we'll have to merge both dataframes

In [14]:
neigh_location = pd.merge(neighborhoods_df, location_data, on='PostalCode')

In [15]:
#let's see how the new dataframe looks like
neigh_location.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


To do the clustering analysis and to display the results, firs we need to import some libraries.

In [17]:
from geopy.geocoders import Nominatim

!conda install -c conda-forge folium=0.5.0 --yes
import folium

import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  19.77 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  17.73 MB/s
vincent-0.4.4- 100% |################################| Time: 0:00:00  19.38 MB/s
folium-0.5.0-p 100% |################################| Time: 0:00:00  23.61 MB/s


Now, using geopy, we'll determine the coordinates of Toronto

In [18]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="tn_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


__Next__, we will generate a map of Toronto with it's neighborhoods superimposed.

In [22]:
%matplotlib inline
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neigh_location['Latitude'], neigh_location['Longitude'], neigh_location['Borough'],neigh_location['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them. We'limit the results to 30.

In [28]:
CLIENT_ID = 'FE5Z4BE34BWUOWLLBGWVPOFGNN2SHHF13LNDJLKQT3AGKRMS'
CLIENT_SECRET ='KSSNOXFWS32B0A40M2O1J4N0TTDWAQXSVGZLHD43DZGOKQSG'
VERSION = '20190401'
LIMIT=30

print('My credentials:')
print('ID:', CLIENT_ID)
print('Secret:', CLIENT_SECRET)

My credentials:
ID: FE5Z4BE34BWUOWLLBGWVPOFGNN2SHHF13LNDJLKQT3AGKRMS
Secret: KSSNOXFWS32B0A40M2O1J4N0TTDWAQXSVGZLHD43DZGOKQSG


For illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods that have 'Toronto' in their name. So let's slice the original dataframe and create a new dataframe of the data.

In [26]:
toronto_neigh = neigh_location[neigh_location['Borough'].str.contains('Toronto')].reset_index(drop=True)
toronto_neigh.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049


The following is a function that finds the nearby venues in those neighborhoods.

In [29]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

So now let's use the function to find the venues near the selected neighborhoods.

In [30]:
toronto_venues = getNearbyVenues(names=toronto_neigh['Neighborhood'],
                                   latitudes=toronto_neigh['Latitude'],
                                   longitudes=toronto_neigh['Longitude']
                                  )

The Beaches
The Danforth West, Riverdale
The Beaches West, India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront, Regent Park
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The 

In [31]:
print(toronto_venues.shape)
toronto_venues.head(10)

(837, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
1,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
2,The Beaches,43.676357,-79.293031,Starbucks,43.678798,-79.298045,Coffee Shop
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant
5,"The Danforth West, Riverdale",43.679557,-79.352188,Dolce Gelato,43.677773,-79.351187,Ice Cream Shop
6,"The Danforth West, Riverdale",43.679557,-79.352188,MenEssentials,43.67782,-79.351265,Cosmetics Shop
7,"The Danforth West, Riverdale",43.679557,-79.352188,Mezes,43.677962,-79.350196,Greek Restaurant
8,"The Danforth West, Riverdale",43.679557,-79.352188,Messini Authentic Gyros,43.677827,-79.350569,Greek Restaurant
9,"The Danforth West, Riverdale",43.679557,-79.352188,Cafe Fiorentina,43.677743,-79.350115,Italian Restaurant


Above we say, that there are 837 venues that were found in our search. Let's check how many per neighborhood. Don't forget there are more neighborhoods in a row, since they are grouped according to their postal code.

In [32]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",30,30,30,30,30,30
Berczy Park,30,30,30,30,30,30
"Brockton, Exhibition Place, Parkdale Village",22,22,22,22,22,22
Business Reply Mail Processing Centre 969 Eastern,19,19,19,19,19,19
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",14,14,14,14,14,14
"Cabbagetown, St. James Town",30,30,30,30,30,30
Central Bay Street,30,30,30,30,30,30
"Chinatown, Grange Park, Kensington Market",30,30,30,30,30,30
Christie,16,16,16,16,16,16
Church and Wellesley,30,30,30,30,30,30


In [34]:
#to have some general info
print('There are {} types of venues in total in the selected neighborhoods.'.format(toronto_venues['Venue Category'].unique().shape[0]))

There are 191 types of venues in total in the selected neighborhoods.


Now we're gonna look ate the venues in each neighborhood or group of neighborhoods, respectively. 

In [39]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Yoga Studio,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thai Restaurant,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [41]:
toronto_onehot.shape

(837, 191)

As we saw, there are 191 types of venues that each got a column.

In [43]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Thai Restaurant,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.033333,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.071429,0.071429,0.071429,0.142857,0.142857,0.142857,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
toronto_grouped.shape

(38, 191)

We're gonna now look at the top 5 venue types in each neighborhood. 

In [45]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
              venue  freq
0        Steakhouse  0.10
1             Hotel  0.07
2              Café  0.07
3  Asian Restaurant  0.07
4       Pizza Place  0.03


----Berczy Park----
                venue  freq
0  Seafood Restaurant  0.07
1        Cocktail Bar  0.07
2              Bakery  0.07
3                Café  0.07
4      Farmers Market  0.07


----Brockton, Exhibition Place, Parkdale Village----
            venue  freq
0  Breakfast Spot  0.09
1     Coffee Shop  0.09
2            Café  0.09
3   Burrito Place  0.05
4         Stadium  0.05


----Business Reply Mail Processing Centre 969 Eastern----
                venue  freq
0         Yoga Studio  0.05
1  Light Rail Station  0.05
2             Butcher  0.05
3          Comic Shop  0.05
4      Farmers Market  0.05


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
              venue  freq
0  Airport Terminal  0.14
1    Airport Lounge  0.14

Let's put that into a pandas dataframe. First, let's write a function to sort the venues in descending order.

In [46]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [49]:
#this is the new dataframe with the top 10 venues
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Steakhouse,Asian Restaurant,Café,Hotel,Gastropub,Noodle House,Monument / Landmark,Opera House,Concert Hall,Pizza Place
1,Berczy Park,Farmers Market,Cocktail Bar,Seafood Restaurant,Bakery,Café,Breakfast Spot,Restaurant,Pub,Bistro,Belgian Restaurant
2,"Brockton, Exhibition Place, Parkdale Village",Breakfast Spot,Café,Coffee Shop,Gym,Furniture / Home Store,Pet Store,Performing Arts Venue,Nightclub,Italian Restaurant,Grocery Store
3,Business Reply Mail Processing Centre 969 Eastern,Yoga Studio,Auto Workshop,Light Rail Station,Restaurant,Skate Park,Brewery,Smoke Shop,Recording Studio,Spa,Farmers Market
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Terminal,Airport Lounge,Boutique,Sculpture Garden,Plane,Harbor / Marina,Boat or Ferry,Airport Gate,Airport Food Court


Finally, we're gonna cluster the neighborhoods in 5 clusters according to the venuew nearby using k-means.

In [51]:
k_clusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=k_clusters, random_state=0).fit(toronto_grouped_clustering)

kmeans.labels_[0:10]

array([0, 0, 3, 3, 3, 0, 3, 0, 0, 3], dtype=int32)

Now, we're gonna add the cluster number to the dataframe.

In [52]:
neighborhoods_venues_sorted.insert(0, 'Cluster label', kmeans.labels_)

And concatenate both dataframes that we have.

In [53]:
toronto_merged = toronto_neigh.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster label,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,3,Coffee Shop,Health Food Store,Pub,Deli / Bodega,Farmers Market,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Dog Run
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,3,Greek Restaurant,Ice Cream Shop,Italian Restaurant,Yoga Studio,Brewery,Restaurant,Bubble Tea Shop,Juice Bar,Spa,Pub
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,3,Park,Gym,Sushi Restaurant,Pet Store,Movie Theater,Pub,Burrito Place,Burger Joint,Brewery,Sandwich Place
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Café,Coffee Shop,American Restaurant,Bakery,Italian Restaurant,Comfort Food Restaurant,Ice Cream Shop,Sandwich Place,Bookstore,Seafood Restaurant
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,4,Bus Line,Park,Swim School,Dance Studio,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store


And now, let's visualize the clusters.

In [64]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k_clusters)
ys = [i + x + (i*x)**2 for i in range(k_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster label']):
    label = folium.Popup(str(poi) + ', Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

After seeing the cluster on the map, we might as well check all the venues in each cluster and sew how they've been grouped.

#### Cluster 1:

In [59]:
toronto_merged.loc[toronto_merged['Cluster label'] == 0, toronto_merged.columns[[2] + list(range(6, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Studio District,Café,Coffee Shop,American Restaurant,Bakery,Italian Restaurant,Comfort Food Restaurant,Ice Cream Shop,Sandwich Place,Bookstore,Seafood Restaurant
11,"Cabbagetown, St. James Town",Restaurant,Coffee Shop,Italian Restaurant,Café,Park,Market,Diner,Pub,Jewelry Store,Beer Store
14,"Ryerson, Garden District",Café,Coffee Shop,Plaza,Movie Theater,Beer Bar,Sandwich Place,Diner,Ramen Restaurant,Burger Joint,Burrito Place
15,St. James Town,Coffee Shop,Gastropub,Restaurant,Italian Restaurant,Japanese Restaurant,Hotel,BBQ Joint,Performing Arts Venue,Church,Cosmetics Shop
16,Berczy Park,Farmers Market,Cocktail Bar,Seafood Restaurant,Bakery,Café,Breakfast Spot,Restaurant,Pub,Bistro,Belgian Restaurant
18,"Adelaide, King, Richmond",Steakhouse,Asian Restaurant,Café,Hotel,Gastropub,Noodle House,Monument / Landmark,Opera House,Concert Hall,Pizza Place
20,"Design Exchange, Toronto Dominion Centre",Coffee Shop,Café,Deli / Bodega,Gym,Concert Hall,Beer Bar,Restaurant,Pub,Pizza Place,Museum
21,"Commerce Court, Victoria Hotel",Café,Restaurant,Coffee Shop,Hotel,Gastropub,Deli / Bodega,Art Gallery,Seafood Restaurant,Beer Bar,Pub
25,"Harbord, University of Toronto",Café,Restaurant,Bookstore,Bar,Bakery,Japanese Restaurant,Gym,Italian Restaurant,Noodle House,College Arts Building
26,"Chinatown, Grange Park, Kensington Market",Café,Vietnamese Restaurant,Caribbean Restaurant,Mexican Restaurant,Bakery,Wine Bar,Gourmet Shop,Beer Bar,Belgian Restaurant,Cheese Shop


#### Cluster 2:

In [60]:
toronto_merged.loc[toronto_merged['Cluster label'] == 1, toronto_merged.columns[[2] + list(range(6, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,"Moore Park, Summerhill East",Playground,Trail,Tennis Court,Wine Bar,Cuban Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store


#### Cluster 3:

In [61]:
toronto_merged.loc[toronto_merged['Cluster label'] == 2, toronto_merged.columns[[2] + list(range(6, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Roselawn,Garden,Wine Bar,Deli / Bodega,Farmers Market,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store


#### Cluster 4:

In [62]:
toronto_merged.loc[toronto_merged['Cluster label'] == 3, toronto_merged.columns[[2] + list(range(6, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,The Beaches,Coffee Shop,Health Food Store,Pub,Deli / Bodega,Farmers Market,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Dog Run
1,"The Danforth West, Riverdale",Greek Restaurant,Ice Cream Shop,Italian Restaurant,Yoga Studio,Brewery,Restaurant,Bubble Tea Shop,Juice Bar,Spa,Pub
2,"The Beaches West, India Bazaar",Park,Gym,Sushi Restaurant,Pet Store,Movie Theater,Pub,Burrito Place,Burger Joint,Brewery,Sandwich Place
5,Davisville North,Hotel,Gym,Breakfast Spot,Food & Drink Shop,Park,Clothing Store,Burger Joint,Sandwich Place,Grocery Store,American Restaurant
6,North Toronto West,Coffee Shop,Sporting Goods Shop,Yoga Studio,Italian Restaurant,Park,Pet Store,Chinese Restaurant,Mexican Restaurant,Rental Car Location,Dessert Shop
7,Davisville,Dessert Shop,Sandwich Place,Coffee Shop,Sushi Restaurant,Pizza Place,Café,Italian Restaurant,Park,Pharmacy,Brewery
9,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",Coffee Shop,Pub,American Restaurant,Fried Chicken Joint,Supermarket,Vietnamese Restaurant,Sushi Restaurant,Pizza Place,Medical Center,Convenience Store
12,Church and Wellesley,Burger Joint,Gay Bar,Park,Bookstore,Breakfast Spot,Restaurant,Bubble Tea Shop,Ramen Restaurant,Pub,Pizza Place
13,"Harbourfront, Regent Park",Coffee Shop,Bakery,Park,Mexican Restaurant,Café,Pub,Breakfast Spot,Performing Arts Venue,Restaurant,Theater
17,Central Bay Street,Coffee Shop,Spa,Italian Restaurant,Bubble Tea Shop,Portuguese Restaurant,Bar,Seafood Restaurant,Sandwich Place,Ramen Restaurant,Park


#### Cluster 5:

In [63]:
toronto_merged.loc[toronto_merged['Cluster label'] == 4, toronto_merged.columns[[2] + list(range(6, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Lawrence Park,Bus Line,Park,Swim School,Dance Studio,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store
10,Rosedale,Park,Playground,Trail,Wine Bar,Cuban Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store
23,"Forest Hill North, Forest Hill West",Park,Trail,Jewelry Store,Sushi Restaurant,Deli / Bodega,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Dog Run


And with that we successfully clustered the neighborhoods of Toronto that are in boroughs which contain "Toronto" in their names. We did that in 5 clusters and visualized it on a map respectively. 

### Thank you for going through this notebook!