## Capstone project week 3

In this project I have explored segmentation and clustering of the neighborhoods in the city of Toronto.
Data about Toronto's Borough and Neighborhood have been taken from wikipedia. Then features about Borough and Neighborhood have been retreived from foursquare. Finally, segmentation and clustering have been applied on these data.
The following notebook is divided in 3 parts:

1) Data are retreived from wikipedia and cleaned

2) Data are completed with information about latitude and longitude

3) Clustering and Segmentation of these data.

## Part 1
In this part, data about Toronto neighborhoods are  taken from https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050. 
and reformatted into a panda dataframe.

In [1]:
import requests                           # allow to send request to website
from bs4 import BeautifulSoup             # used to read data from a web page
import pandas as pd                       # to manipulate data
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import folium                             # map rendering library
from geopy.geocoders import Nominatim     # convert an address into latitude and longitude values
import numpy as np
from sklearn.cluster import KMeans        # import k-means from clustering stage
import matplotlib.cm as cm                # Matplotlib and associated plotting modules
import matplotlib.colors as colors

In [2]:
# Data are read thanks to beautifilSoup library
website_url = requests.get("https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050.").text
soup = BeautifulSoup(website_url,"html.parser")
# We look for a table in the page
My_table = soup.find('table',{'class':'wikitable sortable'})
#My_table # to see the html table

Let's clean our data

In [3]:
# create a string vector with all data
resto = My_table.find_all('td')


lista_text = []
for elements in resto:
    lista_text = lista_text + [elements.string]
 

# we now retreive: Postcodes, Borough, Neighbourhood
Postcodes = []
Boroughs = []
Neighbourhoods = []
st='\n'
l=int(len(lista_text)/3)


for i in range(0,l):
    Postcode = str(lista_text[i*3])
    Borough = str(lista_text[1+i*3])
    Neighbourhood = str(lista_text[2+i*3])

    Postcodes.append(Postcode)
    Boroughs.append(Borough)
    Neighbourhoods.append(Neighbourhood)
  
# Dataframe creation
df = {'Postcode': Postcodes, 'Borough': Boroughs,'Neighbourhood': Neighbourhoods} 
df = pd.DataFrame(df) 
df = df.replace('\n',' ', regex=True)


# Dataframe cleaning

# Get indexes for which Borough is not assigned
indexNotAss = df[df['Borough'] == 'Not assigned'].index
# Delete these row indexes from dataFrame
df.drop(indexNotAss , inplace=True)
# Get indexes for which Neighbourhood is not assigned
indexNeigh = df[df['Neighbourhood'] == 'None'].index
# Replace not assigned Neighbourhood with Borough
df['Neighbourhood'][indexNeigh]=df['Borough'][indexNeigh]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,North York
3,M4A,North York,North York
4,M5A,Downtown Toronto,Downtown Toronto
5,M6A,North York,North York
6,M6A,North York,North York


In [4]:
# More than one neighborhood can exist in one postal code area, these rows will be combined into 
# one row with the neighborhoods separated by a comma
df_group=df.groupby(['Postcode','Borough'])
df_groups=df_group.groups
indices=[]
for groups in df_groups:
    indices.append(df_groups[groups].values)
    
# Once we have all elements with the same postcode we create a dictionary 
df_new= pd.DataFrame(columns=['Postcode','Borough','Neighbourhood'])
count=0
for ind in indices:
    neigh = df.loc[ind]['Neighbourhood'].values
    st=''
    le=len(neigh)
    for i in range(0,le):
        if (neigh[i] not in st):        
            st=st+neigh[i]+', ' 
    st=st[:-2]                               # we have to delete the final comma and space at the end of each element
    df_new.loc[count]=df.loc[ind[0]]
    df_new.loc[count]['Neighbourhood']=st
    count=count+1

# we print the final dataframe
df_new.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,Scarborough
1,M1C,Scarborough,Scarborough
2,M1E,Scarborough,"Guildwood , Scarborough"
3,M1G,Scarborough,Scarborough
4,M1H,Scarborough,Cedarbrae


In [5]:
# We print the dimensions of the final dataframe
df_new.shape

(103, 3)

## Part 2

In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood

In [6]:
# We get the latitude and logitude corresponding to each postal code in the previous dataframe
import geocoder # Import geocoder

In [7]:
# We add latitude and longitude column to the old dataframe
df_new.insert(3, "Latitude", None, True)
df_new.insert(4, "Longitude", None, True)

In [8]:
# Uncomment this part to use geocoder, however this library has some problems, for the moment 
# latitude and longitude data will be read from a csv external file

# Initialize variable to None
#lat_lng_coords = None


#postal_codes=df_new['Postcode']
#for i in range(0,len(postal_codes)):
    # loop until you get the coordinates
#    while(lat_lng_coords is None):
      #print(df_new.loc[i]['Postcode'])
#      g = geocoder.google('{}, Toronto, Ontario'.format(df_new.loc[i]['Postcode']))
#      lat_lng_coords = g.latlng
#    df_new.loc[i]['Latitude'] = lat_lng_coords[0]
#    df_new.loc[i]['Longitude'] = lat_lng_coords[1]       
    


In [9]:
df_coord=pd.read_csv('Geospatial_Coordinates.csv')
# Get indexes for which the postcode of the two dataframe is equal
index_equal = df_new[df_new['Postcode'] == df_coord['Postal Code']].index
# Assign latitude and longitude to the df_new dataframe
df_new['Latitude'][index_equal]=df_coord['Latitude'][index_equal]
df_new['Longitude'][index_equal]=df_coord['Longitude'][index_equal]
df_new.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,Scarborough,43.806686,-79.194353
1,M1C,Scarborough,Scarborough,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood , Scarborough",43.763573,-79.188711
3,M1G,Scarborough,Scarborough,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [10]:
# Display dataframe with latitude and longitude
df_new.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,Scarborough,43.806686,-79.194353
1,M1C,Scarborough,Scarborough,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood , Scarborough",43.763573,-79.188711
3,M1G,Scarborough,Scarborough,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [11]:
address = 'Toronto, TO'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.6534817, -79.3839347.


In [12]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_new['Latitude'], df_new['Longitude'], df_new['Borough'], df_new['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## Part 3

In this part of the notebook we explore segmentation and clustering. However, before we have to retreive data from foursquare

In [13]:
# We look for the number of Neighborhood for each Borough
df_new['Borough'].value_counts()

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
York                 5
East Toronto         5
East York            5
Mississauga          1
Name: Borough, dtype: int64

In [14]:
# We analyze the borough of Downtown Toronto
downtown_data = df_new[df_new['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
downtown_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4W,Downtown Toronto,Downtown Toronto,43.679563,-79.377529
1,M4X,Downtown Toronto,Downtown Toronto,43.667967,-79.367675
2,M4Y,Downtown Toronto,Downtown Toronto,43.66586,-79.38316
3,M5A,Downtown Toronto,Downtown Toronto,43.65426,-79.360636
4,M5B,Downtown Toronto,"Ryerson , Garden District",43.657162,-79.378937


In [15]:
# We plot again the point on Toronto Map using folium
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(downtown_data['Latitude'], downtown_data['Longitude'], downtown_data['Borough'], downtown_data['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [16]:
# we define the credentials to retreive data from foursquare
CLIENT_ID = 'your_ID' # your Foursquare ID
CLIENT_SECRET = 'your_code' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

In [17]:
# we get the first neighboorhooud in out new dataframe
downtown_data.loc[0, 'Neighbourhood'] # in this case it has the same name of the borough

'Downtown Toronto'

Get the neighborhood's latitude and longitude values.

In [18]:
neighborhood_latitude = downtown_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = downtown_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = downtown_data.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Downtown Toronto are 43.6795626, -79.37752940000001.


#### Now, let's get the top 100 venues that are in Toronto Downtown within a radius of 1100 meters.

First, let's create the GET request URL. Name your URL **url**.

In [19]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 1100 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=RIRUAJ0XVCM3JGKHZONTX11KWFVFSIDCZF5F0VUJYEN2G0YX&client_secret=JV2VFFJFRJG2XRZINIWOXIS4NNUIZ1LJGITF4I1XYAAGBA5T&v=20180605&ll=43.6795626,-79.37752940000001&radius=1100&limit=100'

Send the GET request and examine the resutls

In [20]:
results = requests.get(url).json()

In [21]:
# we define a function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [22]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON


# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Summerhill Market,Grocery Store,43.686265,-79.375458
1,Black Camel,BBQ Joint,43.677016,-79.389367
2,Greenhouse Juice Co,Juice Bar,43.679101,-79.390686
3,Toronto Lawn Tennis Club,Athletics & Sports,43.680667,-79.388559
4,Civello Salon & Spa,Salon / Barbershop,43.674413,-79.388378


In [23]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

60 venues were returned by Foursquare.


We repeat the process on all the neighborhoods, but in order to avoid confusion we rename neighborhoods with the same name

In [24]:
count=0

for i in range(0,len(downtown_data)):
    if downtown_data['Neighbourhood'][i]=='Downtown Toronto':
        downtown_data['Neighbourhood'][i]=downtown_data['Neighbourhood'][i]+' '+str(count)
        count=count+1

downtown_data.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4W,Downtown Toronto,Downtown Toronto 0,43.679563,-79.377529
1,M4X,Downtown Toronto,Downtown Toronto 1,43.667967,-79.367675
2,M4Y,Downtown Toronto,Downtown Toronto 2,43.66586,-79.38316
3,M5A,Downtown Toronto,Downtown Toronto 3,43.65426,-79.360636
4,M5B,Downtown Toronto,"Ryerson , Garden District",43.657162,-79.378937


In [25]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [26]:
# type your answer here

toronto_venues = getNearbyVenues(names=downtown_data['Neighbourhood'],
                                   latitudes=downtown_data['Latitude'],
                                   longitudes=downtown_data['Longitude']
                                  )

we check the size of the new dataframe

In [27]:
print(toronto_venues.shape)
toronto_venues.head()

(1289, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Downtown Toronto 0,43.679563,-79.377529,Rosedale Park,43.682328,-79.378934,Playground
1,Downtown Toronto 0,43.679563,-79.377529,Whitney Park,43.682036,-79.373788,Park
2,Downtown Toronto 0,43.679563,-79.377529,Alex Murray Parkette,43.6783,-79.382773,Park
3,Downtown Toronto 0,43.679563,-79.377529,Milkman's Lane,43.676352,-79.373842,Trail
4,Downtown Toronto 1,43.667967,-79.367675,Cranberries,43.667843,-79.369407,Diner


Let's check how many venues were returned for each neighborhood

In [28]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide , King , Richmond",100,100,100,100,100,100
Central Bay Street,78,78,78,78,78,78
Christie,17,17,17,17,17,17
Downtown Toronto 0,4,4,4,4,4,4
Downtown Toronto 1,44,44,44,44,44,44
Downtown Toronto 2,83,83,83,83,83,83
Downtown Toronto 3,44,44,44,44,44,44
Downtown Toronto 4,100,100,100,100,100,100
Downtown Toronto 5,55,55,55,55,55,55
Downtown Toronto 6,100,100,100,100,100,100


#### Let's find out how many unique categories can be curated from all the returned venues

In [29]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 205 uniques categories.


In [30]:
# one hot encoding
toronto_venueshot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_venueshot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_venueshot.columns[-1]] + list(toronto_venueshot.columns[:-1])
toronto_venueshot = toronto_venueshot[fixed_columns]

toronto_venueshot.head()

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


let's examine the new dataframe size.

In [31]:
toronto_venueshot.shape

(1289, 205)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [32]:
toronto_grouped = toronto_venueshot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,"Adelaide , King , Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.01
1,Central Bay Street,0.012821,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012821,...,0.0,0.0,0.0,0.012821,0.0,0.0,0.012821,0.0,0.0,0.0
2,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Downtown Toronto 0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Downtown Toronto 1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Downtown Toronto 2,0.024096,0.012048,0.0,0.0,0.0,0.0,0.0,0.0,0.012048,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012048,0.012048,0.0
6,Downtown Toronto 3,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Downtown Toronto 4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0
8,Downtown Toronto 5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0
9,Downtown Toronto 6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,...,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0


#### Let's print each neighborhood along with the top 5 most common venues

In [33]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide , King , Richmond ----
             venue  freq
0      Coffee Shop  0.07
1       Restaurant  0.06
2             Café  0.05
3           Bakery  0.03
4  Thai Restaurant  0.03


----Central Bay Street ----
                 venue  freq
0          Coffee Shop  0.18
1   Italian Restaurant  0.05
2      Thai Restaurant  0.04
3         Burger Joint  0.04
4  Japanese Restaurant  0.04


----Christie ----
           venue  freq
0  Grocery Store  0.24
1           Café  0.18
2           Park  0.12
3      Nightclub  0.06
4          Diner  0.06


----Downtown Toronto 0----
           venue  freq
0           Park  0.50
1     Playground  0.25
2          Trail  0.25
3    Yoga Studio  0.00
4  Movie Theater  0.00


----Downtown Toronto 1----
                venue  freq
0          Restaurant  0.07
1         Coffee Shop  0.07
2  Chinese Restaurant  0.05
3              Bakery  0.05
4                Café  0.05


----Downtown Toronto 2----
                 venue  freq
0          Coffee Shop  0.06
1

First, let's write a function to sort the venues in descending order.

In [34]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [35]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide , King , Richmond",Coffee Shop,Restaurant,Café,Bakery,Bar,Thai Restaurant,Gym,Sushi Restaurant,Lounge,Salad Place
1,Central Bay Street,Coffee Shop,Italian Restaurant,Sandwich Place,Japanese Restaurant,Burger Joint,Thai Restaurant,Ice Cream Shop,Café,Salad Place,Department Store
2,Christie,Grocery Store,Café,Park,Diner,Baby Store,Restaurant,Candy Store,Gas Station,Nightclub,Coffee Shop
3,Downtown Toronto 0,Park,Playground,Trail,Department Store,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
4,Downtown Toronto 1,Coffee Shop,Restaurant,Café,Italian Restaurant,Chinese Restaurant,Bakery,Pizza Place,Pub,Butcher,Indian Restaurant


## Cluster Neighborhoods

In [36]:
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,"Adelaide , King , Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.01
1,Central Bay Street,0.012821,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012821,...,0.0,0.0,0.0,0.012821,0.0,0.0,0.012821,0.0,0.0,0.0
2,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Downtown Toronto 0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Downtown Toronto 1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Downtown Toronto 2,0.024096,0.012048,0.0,0.0,0.0,0.0,0.0,0.0,0.012048,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012048,0.012048,0.0
6,Downtown Toronto 3,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Downtown Toronto 4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0
8,Downtown Toronto 5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0
9,Downtown Toronto 6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,...,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0


Run *k*-means to cluster the neighborhood into 4 clusters.

In [37]:
# set number of clusters
kclusters = 4

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 0, 1, 2, 2, 2, 2, 2, 2])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [38]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = downtown_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4W,Downtown Toronto,Downtown Toronto 0,43.679563,-79.377529,1,Park,Playground,Trail,Department Store,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run
1,M4X,Downtown Toronto,Downtown Toronto 1,43.667967,-79.367675,2,Coffee Shop,Restaurant,Café,Italian Restaurant,Chinese Restaurant,Bakery,Pizza Place,Pub,Butcher,Indian Restaurant
2,M4Y,Downtown Toronto,Downtown Toronto 2,43.66586,-79.38316,2,Coffee Shop,Japanese Restaurant,Gay Bar,Restaurant,Sushi Restaurant,Yoga Studio,Hotel,Café,Mediterranean Restaurant,Men's Store
3,M5A,Downtown Toronto,Downtown Toronto 3,43.65426,-79.360636,2,Coffee Shop,Park,Pub,Theater,Breakfast Spot,Restaurant,Bakery,Café,Mexican Restaurant,Distribution Center
4,M5B,Downtown Toronto,"Ryerson , Garden District",43.657162,-79.378937,2,Coffee Shop,Clothing Store,Middle Eastern Restaurant,Japanese Restaurant,Cosmetics Shop,Bubble Tea Shop,Café,Diner,Pizza Place,Bookstore


### Clusters visualization

In [39]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Culsters analysis

In [40]:
# Cluster 0
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,Downtown Toronto,0,Grocery Store,Café,Park,Diner,Baby Store,Restaurant,Candy Store,Gas Station,Nightclub,Coffee Shop


In [41]:
# Cluster 1
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,1,Park,Playground,Trail,Department Store,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


In [42]:
# Cluster 2
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Downtown Toronto,2,Coffee Shop,Restaurant,Café,Italian Restaurant,Chinese Restaurant,Bakery,Pizza Place,Pub,Butcher,Indian Restaurant
2,Downtown Toronto,2,Coffee Shop,Japanese Restaurant,Gay Bar,Restaurant,Sushi Restaurant,Yoga Studio,Hotel,Café,Mediterranean Restaurant,Men's Store
3,Downtown Toronto,2,Coffee Shop,Park,Pub,Theater,Breakfast Spot,Restaurant,Bakery,Café,Mexican Restaurant,Distribution Center
4,Downtown Toronto,2,Coffee Shop,Clothing Store,Middle Eastern Restaurant,Japanese Restaurant,Cosmetics Shop,Bubble Tea Shop,Café,Diner,Pizza Place,Bookstore
5,Downtown Toronto,2,Coffee Shop,Café,Restaurant,Hotel,Italian Restaurant,Diner,Breakfast Spot,Cosmetics Shop,Bakery,Beer Bar
6,Downtown Toronto,2,Coffee Shop,Cocktail Bar,Café,Beer Bar,Farmers Market,Restaurant,Seafood Restaurant,Cheese Shop,Bakery,Basketball Stadium
7,Downtown Toronto,2,Coffee Shop,Italian Restaurant,Sandwich Place,Japanese Restaurant,Burger Joint,Thai Restaurant,Ice Cream Shop,Café,Salad Place,Department Store
8,Downtown Toronto,2,Coffee Shop,Restaurant,Café,Bakery,Bar,Thai Restaurant,Gym,Sushi Restaurant,Lounge,Salad Place
9,Downtown Toronto,2,Coffee Shop,Aquarium,Hotel,Café,Italian Restaurant,Scenic Lookout,Brewery,Sporting Goods Shop,Restaurant,Fried Chicken Joint
10,Downtown Toronto,2,Coffee Shop,Café,Hotel,Restaurant,Japanese Restaurant,Bar,Gastropub,Seafood Restaurant,Italian Restaurant,American Restaurant


In [43]:
# Cluster 3
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Downtown Toronto,3,Airport Service,Airport Lounge,Airport Terminal,Boat or Ferry,Harbor / Marina,Sculpture Garden,Rental Car Location,Plane,Coffee Shop,Boutique
