<h2> 1-Libraries </h2>

In [1]:
# Data Analysis Libraries
import pandas as pd
import numpy as np 

#API libraries
import requests
from bs4 import BeautifulSoup
import json

# Library for flatenning json files
from pandas.io.json import json_normalize

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#K-means clustering from Sklearn
from sklearn.cluster import KMeans

#Library to construct and visualize maps
import folium

#Time libraries that will be used throughout the code to assess 
#complexity of any block of code
import datetime 
import time

<h2>2-Importing Data </h2> 

<h5> We first need a table that consists of postal codes and borough names of all neighborhoods in Toronto. We can find such a table <a href= "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"> here </a>  
Moreover, note that we will be using the <a href=" http://beautiful-soup-4.readthedocs.io/en/latest/"> BeautifulSoup </a> library to scrab the web data via the <em> lxml </em> parser. </h5>



In [2]:
start_time = time.time()
################################


URL='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
web=requests.get(URL).text
soup= BeautifulSoup(web, 'lxml')


################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))

Time --- Minutes --- 0.04


<h5> We will now wrangle the lxml document to find the table, and then separate it into into three different lists exactly as in the table on the wikipedia page </h5> 

In [3]:
start_time = time.time()
################################

#Search for the table in the webpage
my_table=soup.find('table')

#Convert the table into a list
entries = list(my_table.find_all('td'))

print("The first five elements of the list are: {}".format(entries[:5]))

################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))

The first five elements of the list are: [<td>M1A
</td>, <td>Not assigned
</td>, <td>Not assigned
</td>, <td>M2A
</td>, <td>Not assigned
</td>]
Time --- Minutes --- 0.0


<h5> Notice how the items in the list are surrounded by tags, indicating that we need further cleaning. </h5> 

In [4]:
start_time = time.time()
################################

#Convert the elements in the list into strings
entries=[str(i) for i in entries]

#removing the tags before and after the texts
proper_entries=[k[4:len(k)-6] for k in entries]

#Separating the entries into 3 lists
PostalCode=[] 
Borough=[]
Neighborhood=[]
i=0
while i<len(proper_entries):
    PostalCode.append(proper_entries[i])
    Borough.append(proper_entries[i+1])
    Neighborhood.append(proper_entries[i+2])
    
    i=i+3
    

toronto_dataset=pd.DataFrame({'PostalCode':PostalCode, 'Borough':Borough, 'Neighborhood': Neighborhood})

################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))

Time --- Minutes --- 0.0


In [5]:
toronto_dataset.head()

Unnamed: 0,Borough,Neighborhood,PostalCode
0,Not assigned,Not assigned,M1A
1,Not assigned,Not assigned,M2A
2,North York,Parkwoods,M3A
3,North York,Victoria Village,M4A
4,Downtown Toronto,"Regent Park, Harbourfront",M5A


<h5> We are only interested in the neighborhoods which are registered in some borough. </h5> 

In [6]:
start_time = time.time()
################################

#Rearranging columns 
toronto_dataset=toronto_dataset[['PostalCode', 'Borough', 'Neighborhood']]

#Removing entries in which Borough is Not assigned
toronto_dataset=toronto_dataset[toronto_dataset['Borough']!='Not assigned']
toronto_dataset.reset_index(inplace=True)
toronto_dataset.drop("index", axis=1, inplace=True)


################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))

Time --- Minutes --- 0.0


In [7]:
toronto_dataset.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [8]:
toronto_dataset.shape

(103, 3)

<h5> We will download a dataset consisting of latitude/longitude coordinates of the neighborhouds in Toronto, given their postal code and then merge it with the original dataset. </h5> 

In [9]:
start_time = time.time()
################################


#Downloading coordinates data
coord=pd.read_csv('http://cocl.us/Geospatial_data')
print(coord.head())




################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))

  Postal Code   Latitude  Longitude
0         M1B  43.806686 -79.194353
1         M1C  43.784535 -79.160497
2         M1E  43.763573 -79.188711
3         M1G  43.770992 -79.216917
4         M1H  43.773136 -79.239476
Time --- Minutes --- 0.11


In [10]:
start_time = time.time()
################################


#changing the name of the Postal Code column in coord to PostalCode
coord.columns=['PostalCode', 'Latitude', 'Longitude']


#merging coordinates data with toronto dataset
tor_df=pd.merge(toronto_dataset,coord,on='PostalCode')


################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))

Time --- Minutes --- 0.0


In [11]:
tor_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


<h2>3-Exploration </h2>

<h5> We will now visualize all the neighborhoods in all boroughs of Toronto using a map. </h5>

In [12]:
start_time = time.time()
################################

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[43.6532, -79.3832], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(tor_df['Latitude'], tor_df['Longitude'], tor_df['Borough'], tor_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))    
map_toronto

Time --- Minutes --- 0.0


<h5>let's simplify the above map and segment and cluster only the neighborhoods in Downtown Toronto. So let's slice the original dataframe and create a new dataframe of the Downtown toronto data </h5>

In [13]:
start_time = time.time()
################################

dt_toronto=tor_df[tor_df['Borough']=='Downtown Toronto']

dt_toronto.reset_index(inplace=True)
dt_toronto.drop('index', axis=1, inplace=True)



################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))

Time --- Minutes --- 0.0


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [14]:
print(dt_toronto.shape)
dt_toronto.head()

(19, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


<h5>Just as we did before, let's visualize only the neighborhoods in Downtown Toronto using a map. The longitude and latitude were directly taken from google this time.</h5>

In [15]:
start_time = time.time()
################################

map_dttoronto = folium.Map(location=[43.6548, -79.3883], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(dt_toronto['Latitude'], dt_toronto['Longitude'], dt_toronto['Borough'], dt_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dttoronto)  
    
################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))   
map_dttoronto

Time --- Minutes --- 0.0


<h5> Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them. In what follows, we are using the author's client ID and secret credentials in the variables: 
<code> CLIENT_ID </code? and <code> CLIENT_SECRET </code> respectively. To create your own credentials, please refer to the <a href="https://developer.foursquare.com/docs/places-api/"> Foursquare API documentation <a>  </h5> 

In [16]:
#Foursquare API version
VERSION = '20200625'

<h5>We will now discover the first Downtown Toronto neighborhood in our dataframe by first recording its latitude and longitude, and then using the Foursquare API to determine 100 venues in the neighborhood up to a radius of 500 meters. </h5>

In [17]:
start_time = time.time()
################################

#Neighborhood info
neigh_name=dt_toronto.loc[0,'Neighborhood']
neigh_lat=dt_toronto.loc[0,'Latitude']
neigh_long=dt_toronto.loc[0,'Longitude']

print("The neighborhood is {} and its latitude and longitude are [{}, {}]".format(neigh_name, neigh_lat, neigh_long))


#API call
url = 'https://api.foursquare.com/v2/venues/explore'
#Parameters
params = dict(
client_id=CLIENT_ID,
client_secret=CLIENT_SECRET,
v=VERSION,
ll=str(neigh_lat)+','+str(neigh_long),
radius=500,
limit=100
)

#getting the response
resp = requests.get(url=url, params=params)
data = json.loads(resp.text)

################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))  

The neighborhood is Regent Park, Harbourfront and its latitude and longitude are [43.6542599, -79.3606359]
Time --- Minutes --- 0.01


In [18]:
data['response']['groups'][0]['items'][:2]

[{'reasons': {'count': 0,
   'items': [{'reasonName': 'globalInteractionReason',
     'summary': 'This spot is popular',
     'type': 'general'}]},
  'referralId': 'e-0-54ea41ad498e9a11e9e13308-0',
  'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/bakery_',
      'suffix': '.png'},
     'id': '4bf58dd8d48988d16a941735',
     'name': 'Bakery',
     'pluralName': 'Bakeries',
     'primary': True,
     'shortName': 'Bakery'}],
   'id': '54ea41ad498e9a11e9e13308',
   'location': {'address': '362 King St E',
    'cc': 'CA',
    'city': 'Toronto',
    'country': 'Canada',
    'crossStreet': 'Trinity St',
    'distance': 143,
    'formattedAddress': ['362 King St E (Trinity St)',
     'Toronto ON M5A 1K9',
     'Canada'],
    'labeledLatLngs': [{'label': 'display',
      'lat': 43.653446723052674,
      'lng': -79.3620167174383}],
    'lat': 43.653446723052674,
    'lng': -79.3620167174383,
    'postalCode': 'M5A 1K9',
    'state': 'ON'},
   'name': '

<h5> The <code> data </code> variable stores a lot of information in json format. We know that a lot of the information of each called venue is stored in the "items" key in the json file. We will first define a function that extracts the category of the venue. Then we will store the name, category, latitude and longitude of each venue in a dataframe. </h5>



In [19]:
start_time = time.time()
################################

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


    
venues = data['response']['groups'][0]['items']

#flatten JSON
nearby_venues = json_normalize(venues) 

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns] 

################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))  

Time --- Minutes --- 0.0


In [20]:
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Roselle Desserts,Bakery,43.653447,-79.362017
1,Tandem Coffee,Coffee Shop,43.653559,-79.361809
2,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008
3,Body Blitz Spa East,Spa,43.654735,-79.359874
4,Dominion Pub and Kitchen,Pub,43.656919,-79.358967


In [21]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

44 venues were returned by Foursquare.


<h5>Let's create a function that repeats the process above for all neighborhoods in Downtown Toronto. </h5>

In [23]:
start_time = time.time()
################################

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            100)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

downtowntor_venues = getNearbyVenues(names=dt_toronto['Neighborhood'],
                                   latitudes=dt_toronto['Latitude'],
                                   longitudes=dt_toronto['Longitude']
                                  )

################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))  

Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Stn A PO Boxes
St. James Town, Cabbagetown
First Canadian Place, Underground city
Church and Wellesley
Time --- Minutes --- 0.29


In [24]:
print(downtowntor_venues.shape)
downtowntor_venues.head()

(1218, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,"Regent Park, Harbourfront",43.65426,-79.360636,Dominion Pub and Kitchen,43.656919,-79.358967,Pub


<h5> Let's check how many venues were returned for each neighborhood </h5>

In [25]:
downtowntor_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,56,56,56,56,56,56
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",16,16,16,16,16,16
Central Bay Street,64,64,64,64,64,64
Christie,17,17,17,17,17,17
Church and Wellesley,75,75,75,75,75,75
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
"First Canadian Place, Underground city",100,100,100,100,100,100
"Garden District, Ryerson",100,100,100,100,100,100
"Harbourfront East, Union Station, Toronto Islands",100,100,100,100,100,100
"Kensington Market, Chinatown, Grange Park",62,62,62,62,62,62


<h5> Let's find out how many unique categories can be curated from all the returned venues </h5>

In [26]:
print('There are {} uniques categories.'.format(len(downtowntor_venues['Venue Category'].unique())))

There are 213 uniques categories.


In [27]:
start_time = time.time()
################################

# one hot encoding
dttor_onehot = pd.get_dummies(downtowntor_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
dttor_onehot['Neighborhood'] = downtowntor_venues['Neighborhood'] 

# move neighborhood column to the first column
neighborhoods_index=dttor_onehot.columns.values.tolist().index('Neighborhood')
other_cols=dttor_onehot.columns.values.tolist() 
other_cols.remove('Neighborhood')

fixed_columns = [dttor_onehot.columns[neighborhoods_index]] + other_cols
dttor_onehot = dttor_onehot[fixed_columns]


################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2))  

Time --- Minutes --- 0.0


In [28]:
print(dttor_onehot.shape)
dttor_onehot.head()

(1218, 213)


Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<h5> Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category </h5>

In [29]:
dttor_grouped = dttor_onehot.groupby('Neighborhood').mean().reset_index()
print(dttor_grouped.shape)
dttor_grouped

(19, 213)


Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theme Restaurant,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0
1,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0625,0.0625,0.0625,0.125,0.1875,0.125,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.015625,0.0,0.0,0.015625,0.0,0.0,0.015625
3,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Church and Wellesley,0.013333,0.0,0.0,0.0,0.0,0.0,0.0,0.013333,0.0,...,0.013333,0.0,0.0,0.0,0.0,0.0,0.0,0.013333,0.0,0.026667
5,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0
6,"First Canadian Place, Underground city",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,...,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0
7,"Garden District, Ryerson",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.0
8,"Harbourfront East, Union Station, Toronto Islands",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0,0.0,0.0
9,"Kensington Market, Chinatown, Grange Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.064516,0.0,0.048387,0.016129,0.0,0.0,0.0


<h5> Let's print each neighborhood along with the top 5 most common venues </h5>

In [30]:
start_time = time.time()
################################

num_top_venues = 5

for hood in dttor_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = dttor_grouped[dttor_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')
    

################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2)) 

----Berczy Park----
                venue  freq
0         Coffee Shop  0.09
1        Cocktail Bar  0.05
2          Restaurant  0.04
3         Cheese Shop  0.04
4  Seafood Restaurant  0.04


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
              venue  freq
0   Airport Service  0.19
1    Airport Lounge  0.12
2  Airport Terminal  0.12
3     Boat or Ferry  0.06
4          Boutique  0.06


----Central Bay Street----
                 venue  freq
0          Coffee Shop  0.17
1       Sandwich Place  0.06
2   Italian Restaurant  0.06
3                 Café  0.05
4  Japanese Restaurant  0.05


----Christie----
                venue  freq
0       Grocery Store  0.24
1                Café  0.18
2                Park  0.12
3  Italian Restaurant  0.06
4           Nightclub  0.06


----Church and Wellesley----
                 venue  freq
0          Coffee Shop  0.07
1     Sushi Restaurant  0.07
2  Japanese Restaurant  0.05
3 

<h5>Let's put that into a *pandas* dataframe. We will first write a function to sort the venues in descending order. Then, we will create the new dataframe and display the top 10 venues for each neighborhood. </h5>. 

In [31]:
start_time = time.time()
################################

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = dttor_grouped['Neighborhood']

for ind in np.arange(dttor_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dttor_grouped.iloc[ind, :], num_top_venues)



################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2)) 

Time --- Minutes --- 0.0


In [32]:
neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Restaurant,Cheese Shop,Seafood Restaurant,Beer Bar,Bakery,Café,Irish Pub,Diner
1,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport Terminal,Plane,Harbor / Marina,Boat or Ferry,Rental Car Location,Boutique,Sculpture Garden,Airport
2,Central Bay Street,Coffee Shop,Sandwich Place,Italian Restaurant,Café,Japanese Restaurant,Salad Place,Thai Restaurant,Burger Joint,Bubble Tea Shop,Department Store
3,Christie,Grocery Store,Café,Park,Coffee Shop,Candy Store,Italian Restaurant,Diner,Restaurant,Athletics & Sports,Baby Store
4,Church and Wellesley,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Gay Bar,Yoga Studio,Men's Store,Mediterranean Restaurant,Hotel,Pub


<h3> Clustering </h3>

<h5> We will now run *k*-means to cluster the neighborhood into 5 clusters. Then we will create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.</h5> 

In [33]:
start_time = time.time()
################################


# set number of clusters
kclusters = 5

dttor_grouped_clustering = dttor_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dttor_grouped_clustering)

# check cluster labels generated for each row in the dataframe
print("The labels are: {} ".format(kmeans.labels_[0:10]))

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

dttor_merged = dt_toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
dttor_merged = dttor_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2)) 


The labels are: [2 0 4 3 2 2 2 2 2 2] 
Time --- Minutes --- 0.0


In [34]:
dttor_merged.head() 

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,2,Coffee Shop,Bakery,Pub,Park,Theater,Breakfast Spot,Café,Restaurant,Spa,Distribution Center
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,4,Coffee Shop,Diner,Sushi Restaurant,Discount Store,Smoothie Shop,Beer Bar,Italian Restaurant,Sculpture Garden,Sandwich Place,Distribution Center
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,2,Clothing Store,Coffee Shop,Japanese Restaurant,Italian Restaurant,Middle Eastern Restaurant,Café,Bubble Tea Shop,Cosmetics Shop,Diner,Lingerie Store
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,2,Café,Coffee Shop,Restaurant,American Restaurant,Gastropub,Cocktail Bar,Clothing Store,Moroccan Restaurant,Cosmetics Shop,Lingerie Store
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,2,Coffee Shop,Cocktail Bar,Restaurant,Cheese Shop,Seafood Restaurant,Beer Bar,Bakery,Café,Irish Pub,Diner


<h5> The following map will present each cluster by different colors. In other words, neighborhoods with the same color indicate that they are in the same cluster </h5>

In [35]:
start_time = time.time()
################################

# create map
map_clusters = folium.Map(location=[43.6548, -79.3883], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dttor_merged['Latitude'], dttor_merged['Longitude'], dttor_merged['Neighborhood'], dttor_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

################################
print("Time --- Minutes ---", np.round((time.time() - start_time) / 60, decimals=2)) 


Time --- Minutes --- 0.0


In [36]:
map_clusters

<h5> Finally, we'll take a look at each cluster separately </h5>

In [37]:
#Cluster 0
dttor_merged[dttor_merged['Cluster Labels']==0]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
13,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442,0,Airport Service,Airport Lounge,Airport Terminal,Plane,Harbor / Marina,Boat or Ferry,Rental Car Location,Boutique,Sculpture Garden,Airport


In [38]:
#Cluster 1
dttor_merged[dttor_merged['Cluster Labels']==1]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,1,Park,Playground,Trail,Cosmetics Shop,Dog Run,Distribution Center,Discount Store,Diner,Dim Sum Restaurant,Dessert Shop


In [39]:
#Cluster 2
dttor_merged[dttor_merged['Cluster Labels']==2]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,2,Coffee Shop,Bakery,Pub,Park,Theater,Breakfast Spot,Café,Restaurant,Spa,Distribution Center
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,2,Clothing Store,Coffee Shop,Japanese Restaurant,Italian Restaurant,Middle Eastern Restaurant,Café,Bubble Tea Shop,Cosmetics Shop,Diner,Lingerie Store
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,2,Café,Coffee Shop,Restaurant,American Restaurant,Gastropub,Cocktail Bar,Clothing Store,Moroccan Restaurant,Cosmetics Shop,Lingerie Store
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,2,Coffee Shop,Cocktail Bar,Restaurant,Cheese Shop,Seafood Restaurant,Beer Bar,Bakery,Café,Irish Pub,Diner
7,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568,2,Coffee Shop,Café,Restaurant,Gym,Hotel,Thai Restaurant,Deli / Bodega,Concert Hall,Bookstore,Steakhouse
8,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752,2,Coffee Shop,Aquarium,Café,Hotel,Restaurant,Sporting Goods Shop,Brewery,Fried Chicken Joint,Scenic Lookout,Park
9,M5K,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",43.647177,-79.381576,2,Coffee Shop,Hotel,Café,Restaurant,Salad Place,Seafood Restaurant,Japanese Restaurant,Italian Restaurant,American Restaurant,Beer Bar
10,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel",43.648198,-79.379817,2,Coffee Shop,Restaurant,Café,Hotel,Gym,American Restaurant,Japanese Restaurant,Italian Restaurant,Seafood Restaurant,Cocktail Bar
11,M5S,Downtown Toronto,"University of Toronto, Harbord",43.662696,-79.400049,2,Café,Japanese Restaurant,Bookstore,Bar,Italian Restaurant,Bakery,Restaurant,Beer Bar,Beer Store,College Gym
12,M5T,Downtown Toronto,"Kensington Market, Chinatown, Grange Park",43.653206,-79.400049,2,Café,Vegetarian / Vegan Restaurant,Bakery,Mexican Restaurant,Vietnamese Restaurant,Coffee Shop,Grocery Store,Dessert Shop,Park,Pizza Place


In [40]:
#Cluster 3
dttor_merged[dttor_merged['Cluster Labels']==3]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,M6G,Downtown Toronto,Christie,43.669542,-79.422564,3,Grocery Store,Café,Park,Coffee Shop,Candy Store,Italian Restaurant,Diner,Restaurant,Athletics & Sports,Baby Store


In [41]:
#Cluster 4
dttor_merged[dttor_merged['Cluster Labels']==4]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,4,Coffee Shop,Diner,Sushi Restaurant,Discount Store,Smoothie Shop,Beer Bar,Italian Restaurant,Sculpture Garden,Sandwich Place,Distribution Center
5,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,4,Coffee Shop,Sandwich Place,Italian Restaurant,Café,Japanese Restaurant,Salad Place,Thai Restaurant,Burger Joint,Bubble Tea Shop,Department Store
