## IBM Data Science Course Capstone Project




by A.Rakhmetov






### Assignment Week 1

In [1]:
import pandas as pd
import numpy as np
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


### Assignment Week 3. Segmenting and Clustering Neighborhoods in Toronto

## 1. Download and Explore Dataset

#### Parsing the Toronto postal codes data from Wikipedia site:

In [2]:
from bs4 import BeautifulSoup
import requests


wiki_page = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(wiki_page,'html.parser')

#The html table with postal codes:
My_table = soup.find('table', class_='wikitable sortable')


#### Preparing the 2D NumPy array for the dataframe operations in Pandas:

In [3]:

List=[]
for cell in My_table.find_all('td'):
    List.append(cell.text.replace('\n',''))
   
data = np.reshape(np.array(List), (-1,3))

print(data.shape)
data[0:5]

(289, 3)


array([['M1A', 'Not assigned', 'Not assigned'],
       ['M2A', 'Not assigned', 'Not assigned'],
       ['M3A', 'North York', 'Parkwoods'],
       ['M4A', 'North York', 'Victoria Village'],
       ['M5A', 'Downtown Toronto', 'Harbourfront']], dtype='<U49')

#### Implementing and pre-processing the dataframe according to the requirements:

In [4]:
# Creating the dataframe out of the numpy array:
df = pd.DataFrame({'PostalCode':data[:,0], 'Borough':data[:,1], 'Neighborhood':data[:,2]})

#Getting rid of "Not assigned" cells in the 'Borough' and 'Neighborhood' columns:
df = df[df.Borough != "Not assigned"]
df['Neighborhood'] = np.where(df['Neighborhood']=='Not assigned', df['Borough'], df['Neighborhood'])

# Grouping the Neighborhoods in lists according to the postal codes:
df = df.groupby(['PostalCode','Borough'])['Neighborhood'].apply(list).reset_index()

# Merging the lists into one string separated with commas:
df['Neighborhood'] = df['Neighborhood'].str.join(', ')


df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [5]:
df.shape

(103, 3)

#### Getting the geospatial coordinates of each Postal Code:

In [6]:
# Installing the geocoder:
!conda install -c conda-forge geocoder --yes

Solving environment: done

# All requested packages already installed.



In [None]:
import geocoder 

g = geocoder.google("M5G, Toronto, Ontario")
print(g.latlng)


Unfortunarely, after numerous fail attempts to obtain geospatial coordinates from the Geocoder API, I had to use the prepared csv file 

In [None]:
gdf=pd.read_csv('http://cocl.us/geospatial_data')
gdf.head()

In [None]:
# Merging the dataframes in one:

toronto_df = pd.DataFrame(pd.merge(df,gdf, left_on = 'PostalCode', right_on = 'Postal Code').drop('Postal Code', 1))

toronto_df.head()

In [None]:
toronto_df.shape

## 2. Explore Neighborhoods in the City of Toronto

#### Creating the map of Toronto City.

Before we import all libraries we need:

In [11]:
import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Libraries imported.


In [13]:
# Using geopy library to get coordinates of Toronto City:
address = 'Toronto, ON'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

  after removing the cwd from sys.path.


The geograpical coordinate of Toronto City are 43.653963, -79.387207.


In [14]:
# Creating the map of Toronto using coordinates:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)


# adding markers of postal codes to the map:
for lat, lng, borough, postalcode in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['PostalCode']):
    label = '{}, {}'.format(postalcode, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [15]:
boroughs = toronto_df.groupby('Borough')['PostalCode'].count()
boroughs

Borough
Central Toronto      9
Downtown Toronto    18
East Toronto         5
East York            5
Etobicoke           12
Mississauga          1
North York          24
Queen's Park         1
Scarborough         17
West Toronto         6
York                 5
Name: PostalCode, dtype: int64

## 3. Analyze Each Neighborhood

To analyze only the city of Toronto we condider the boroughs with the word 'Toronto' in the name:

In [16]:
tdf = toronto_df[toronto_df['Borough'].str.contains("Toronto")].reset_index()
print(tdf.shape)
tdf.head()

(38, 6)


Unnamed: 0,index,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,37,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,43,M4M,East Toronto,Studio District,43.659526,-79.340923
4,44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [17]:
CLIENT_ID = 'UW4A0SIAAZUPMAV2H0ZWELOGEHQXO0BZFGIDE1CRGW0I3LN2' # your Foursquare ID
CLIENT_SECRET = 'API1GRDKJAIUTSTKFHUSTV0RYDHKTCLS3V5IOVQ2NCIHFUB3' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: UW4A0SIAAZUPMAV2H0ZWELOGEHQXO0BZFGIDE1CRGW0I3LN2
CLIENT_SECRET:API1GRDKJAIUTSTKFHUSTV0RYDHKTCLS3V5IOVQ2NCIHFUB3


#### Let's create a function to repeat the same process to all the neighborhoods in Manhattan

In [18]:
def getNearbyVenues(names, latitudes, longitudes, LIMIT=100, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called *toronto_venues*.

In [20]:
# type your answer here
toronto_venues = getNearbyVenues(names=tdf['Neighborhood'],
                                 latitudes=tdf['Latitude'],
                                 longitudes=tdf['Longitude'])





The Beaches
The Danforth West, Riverdale
The Beaches West, India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront, Regent Park
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The 

#### Let's check the size of the resulting dataframe

In [21]:
print(toronto_venues.shape)
toronto_venues.head()

(1701, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
1,The Beaches,43.676357,-79.293031,Starbucks,43.678798,-79.298045,Coffee Shop
2,The Beaches,43.676357,-79.293031,Glen Stewart Park,43.675278,-79.294647,Park
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


Let's check how many venues were returned for each neighborhood

In [22]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Berczy Park,56,56,56,56,56,56
"Brockton, Exhibition Place, Parkdale Village",18,18,18,18,18,18
Business Reply Mail Processing Centre 969 Eastern,19,19,19,19,19,19
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",12,12,12,12,12,12
"Cabbagetown, St. James Town",43,43,43,43,43,43
Central Bay Street,81,81,81,81,81,81
"Chinatown, Grange Park, Kensington Market",100,100,100,100,100,100
Christie,15,15,15,15,15,15
Church and Wellesley,87,87,87,87,87,87


#### Let's find out how many unique categories can be curated from all the returned venues

In [23]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 236 uniques categories.


In [24]:
# one hot encoding
e = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# adding the column 'Neighborhood' as the first one:
g = pd.DataFrame(toronto_venues['Neighborhood'])
toronto_onehot = pd.merge(g,e, right_index = True, left_index = True).drop('Neighborhood_y',1).rename(columns={'Neighborhood_x':'Neighborhood'})
print(toronto_onehot.shape)
toronto_onehot.head()

(1701, 236)


Unnamed: 0,Neighborhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [25]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.083333,0.083333,0.166667,0.166667,0.166667,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023256,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012346,0.0,...,0.0,0.0,0.0,0.012346,0.0,0.0,0.012346,0.0,0.0,0.012346
7,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.05,0.0,0.04,0.01,0.0,0.0,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.011494,0.011494,0.0,0.0,0.0,0.0,0.0,0.011494,0.0,...,0.0,0.0,0.0,0.0,0.011494,0.011494,0.0,0.011494,0.0,0.011494


#### Let's print each neighborhood along with the top 5 most common venues

In [26]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
                 venue  freq
0          Coffee Shop  0.06
1                 Café  0.05
2           Steakhouse  0.04
3  American Restaurant  0.04
4      Thai Restaurant  0.04


----Berczy Park----
            venue  freq
0     Coffee Shop  0.07
1      Restaurant  0.05
2    Cocktail Bar  0.05
3  Farmers Market  0.04
4     Cheese Shop  0.04


----Brockton, Exhibition Place, Parkdale Village----
                    venue  freq
0                    Café  0.11
1          Breakfast Spot  0.11
2             Coffee Shop  0.11
3  Furniture / Home Store  0.06
4                 Stadium  0.06


----Business Reply Mail Processing Centre 969 Eastern----
                venue  freq
0  Light Rail Station  0.11
1          Restaurant  0.05
2          Smoke Shop  0.05
3         Pizza Place  0.05
4       Burrito Place  0.05


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
              venue  freq
0    Airp

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [27]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [28]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,American Restaurant,Steakhouse,Thai Restaurant,Bakery,Bar,Restaurant,Clothing Store,Hotel
1,Berczy Park,Coffee Shop,Restaurant,Cocktail Bar,Bakery,Steakhouse,Farmers Market,Cheese Shop,Seafood Restaurant,Italian Restaurant,Beer Bar
2,"Brockton, Exhibition Place, Parkdale Village",Breakfast Spot,Café,Coffee Shop,Pet Store,Caribbean Restaurant,Furniture / Home Store,Climbing Gym,Italian Restaurant,Stadium,Bar
3,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Yoga Studio,Recording Studio,Smoke Shop,Skate Park,Brewery,Burrito Place,Butcher,Restaurant,Comic Shop
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Terminal,Airport Lounge,Plane,Sculpture Garden,Harbor / Marina,Boat or Ferry,Airport Food Court,Airport,Creperie
5,"Cabbagetown, St. James Town",Coffee Shop,Restaurant,Pizza Place,Bakery,Café,Italian Restaurant,Pub,Jewelry Store,Bank,Japanese Restaurant
6,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Bar,Burger Joint,Chinese Restaurant,Falafel Restaurant,Spa,Sandwich Place,Salad Place
7,"Chinatown, Grange Park, Kensington Market",Café,Bar,Vegetarian / Vegan Restaurant,Bakery,Dumpling Restaurant,Vietnamese Restaurant,Coffee Shop,Mexican Restaurant,Chinese Restaurant,Dessert Shop
8,Christie,Café,Grocery Store,Park,Convenience Store,Coffee Shop,Baby Store,Restaurant,Italian Restaurant,Nightclub,Diner
9,Church and Wellesley,Japanese Restaurant,Sushi Restaurant,Coffee Shop,Gay Bar,Restaurant,Burger Joint,Café,Gastropub,Fast Food Restaurant,Mediterranean Restaurant


## 4. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [29]:
# import k-means from clustering stage:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_ 

array([0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0,
       0, 0, 2, 0, 0, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [30]:
# add clustering labels
neighborhoods_venues_sorted.insert(0,'Cluster Labels', kmeans.labels_)
neighborhoods_venues_sorted

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,"Adelaide, King, Richmond",Coffee Shop,Café,American Restaurant,Steakhouse,Thai Restaurant,Bakery,Bar,Restaurant,Clothing Store,Hotel
1,0,Berczy Park,Coffee Shop,Restaurant,Cocktail Bar,Bakery,Steakhouse,Farmers Market,Cheese Shop,Seafood Restaurant,Italian Restaurant,Beer Bar
2,0,"Brockton, Exhibition Place, Parkdale Village",Breakfast Spot,Café,Coffee Shop,Pet Store,Caribbean Restaurant,Furniture / Home Store,Climbing Gym,Italian Restaurant,Stadium,Bar
3,0,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Yoga Studio,Recording Studio,Smoke Shop,Skate Park,Brewery,Burrito Place,Butcher,Restaurant,Comic Shop
4,4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Terminal,Airport Lounge,Plane,Sculpture Garden,Harbor / Marina,Boat or Ferry,Airport Food Court,Airport,Creperie
5,0,"Cabbagetown, St. James Town",Coffee Shop,Restaurant,Pizza Place,Bakery,Café,Italian Restaurant,Pub,Jewelry Store,Bank,Japanese Restaurant
6,0,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Bar,Burger Joint,Chinese Restaurant,Falafel Restaurant,Spa,Sandwich Place,Salad Place
7,0,"Chinatown, Grange Park, Kensington Market",Café,Bar,Vegetarian / Vegan Restaurant,Bakery,Dumpling Restaurant,Vietnamese Restaurant,Coffee Shop,Mexican Restaurant,Chinese Restaurant,Dessert Shop
8,0,Christie,Café,Grocery Store,Park,Convenience Store,Coffee Shop,Baby Store,Restaurant,Italian Restaurant,Nightclub,Diner
9,0,Church and Wellesley,Japanese Restaurant,Sushi Restaurant,Coffee Shop,Gay Bar,Restaurant,Burger Joint,Café,Gastropub,Fast Food Restaurant,Mediterranean Restaurant


In [36]:
# merge neighborhoods_venues_sorted with tdf to add latitude/longitude for each neighborhood
toronto_merged =[]
toronto_merged = pd.merge(tdf,neighborhoods_venues_sorted, on='Neighborhood').drop('index',1)

toronto_merged # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Coffee Shop,Park,Pub,Yoga Studio,Dog Run,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Ice Cream Shop,Bookstore,Italian Restaurant,Pizza Place,Diner,Dessert Shop,Pub,Caribbean Restaurant
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,0,Park,Sandwich Place,Burger Joint,Steakhouse,Food & Drink Shop,Fish & Chips Shop,Liquor Store,Fast Food Restaurant,Brewery,Burrito Place
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Café,Coffee Shop,American Restaurant,Italian Restaurant,Bakery,Fish Market,Bookstore,Latin American Restaurant,Brewery,Seafood Restaurant
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,0,Park,Dim Sum Restaurant,Swim School,Bus Line,Yoga Studio,Doner Restaurant,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197,0,Restaurant,Breakfast Spot,Gym,Grocery Store,Park,Hotel,Sandwich Place,Burger Joint,Food & Drink Shop,Clothing Store
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678,0,Sporting Goods Shop,Coffee Shop,Yoga Studio,Clothing Store,Chinese Restaurant,Dessert Shop,Diner,Rental Car Location,Salon / Barbershop,Sandwich Place
7,M4S,Central Toronto,Davisville,43.704324,-79.38879,0,Dessert Shop,Pizza Place,Sandwich Place,Pharmacy,Café,Italian Restaurant,Seafood Restaurant,Coffee Shop,Sushi Restaurant,Toy / Game Store
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,2,Park,Yoga Studio,Dog Run,Fish & Chips Shop,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant
9,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049,0,Coffee Shop,Pub,Pizza Place,Sushi Restaurant,Convenience Store,Sports Bar,Bagel Shop,Fried Chicken Joint,Supermarket,Light Rail Station


Finally, let's visualize the resulting clusters

In [34]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5. Examine Clusters

Based on the defining categories, we can  assign a name to each cluster. 

#### Cluster 0. Dining areas

In [37]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,The Beaches,0,Coffee Shop,Park,Pub,Yoga Studio,Dog Run,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space
1,"The Danforth West, Riverdale",0,Greek Restaurant,Coffee Shop,Ice Cream Shop,Bookstore,Italian Restaurant,Pizza Place,Diner,Dessert Shop,Pub,Caribbean Restaurant
2,"The Beaches West, India Bazaar",0,Park,Sandwich Place,Burger Joint,Steakhouse,Food & Drink Shop,Fish & Chips Shop,Liquor Store,Fast Food Restaurant,Brewery,Burrito Place
3,Studio District,0,Café,Coffee Shop,American Restaurant,Italian Restaurant,Bakery,Fish Market,Bookstore,Latin American Restaurant,Brewery,Seafood Restaurant
4,Lawrence Park,0,Park,Dim Sum Restaurant,Swim School,Bus Line,Yoga Studio,Doner Restaurant,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant
5,Davisville North,0,Restaurant,Breakfast Spot,Gym,Grocery Store,Park,Hotel,Sandwich Place,Burger Joint,Food & Drink Shop,Clothing Store
6,North Toronto West,0,Sporting Goods Shop,Coffee Shop,Yoga Studio,Clothing Store,Chinese Restaurant,Dessert Shop,Diner,Rental Car Location,Salon / Barbershop,Sandwich Place
7,Davisville,0,Dessert Shop,Pizza Place,Sandwich Place,Pharmacy,Café,Italian Restaurant,Seafood Restaurant,Coffee Shop,Sushi Restaurant,Toy / Game Store
9,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",0,Coffee Shop,Pub,Pizza Place,Sushi Restaurant,Convenience Store,Sports Bar,Bagel Shop,Fried Chicken Joint,Supermarket,Light Rail Station
11,"Cabbagetown, St. James Town",0,Coffee Shop,Restaurant,Pizza Place,Bakery,Café,Italian Restaurant,Pub,Jewelry Store,Bank,Japanese Restaurant


#### Cluster 1. Family area

In [38]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Roselawn,1,Music Venue,Home Service,Pool,Garden,Yoga Studio,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space


#### Cluster 2. Park areas

In [39]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,"Moore Park, Summerhill East",2,Park,Yoga Studio,Dog Run,Fish & Chips Shop,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant


#### Cluster 3. Trail area

In [40]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,Rosedale,3,Park,Playground,Trail,Yoga Studio,Dog Run,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Event Space
23,"Forest Hill North, Forest Hill West",3,Park,Trail,Sushi Restaurant,Jewelry Store,Yoga Studio,Doner Restaurant,Filipino Restaurant,Fast Food Restaurant,Farmers Market,Falafel Restaurant


#### Cluster 4. Airport area

In [41]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
27,"CN Tower, Bathurst Quay, Island airport, Harbo...",4,Airport Service,Airport Terminal,Airport Lounge,Plane,Sculpture Garden,Harbor / Marina,Boat or Ferry,Airport Food Court,Airport,Creperie
