<center><h1>Capstone Project:  Clustering Neighborhoods in Toronto.</h1></center>
<h2>Introduction</h2>
<p>In this notebook, we'll explore, segment, and cluster the neighborhoods in the city of Toronto. All of this, for later discover insights about the different places in this city. We could be considering different questions to have in count, like: how it is aggrouped the different places in this city? How much and Which are the main classes of aggolmerative places and which characteristics do they share?</p>
<p>Also, we'll notice how in the process of go throughout the stages for this project, more questions are going to take place.</p>

<p>This notebook will consist of two stages: </p>
<ul>
    <a href='#Data-collection-and-wrangling'><li>Data collection and wrangling.</li></a>
    <a href='#Geolocalization-stage.'><li>Geolocalization and clustering.</li></a>
    <a href='#Map-visualization-and-fetching-venues.'><li>Map visualization and fetching venues.</li></a>
    <a href='#Analizing-the-clusters'><li>Analizing clusters.</li></a>
    <a href='#Conclusions'><li>Conclusions</li></a>
</ul>

<h2>Data collection and wrangling</h2>

<p>This stage will require us to: </p>
<p>Firstly, extract the data. In our case of study, the information of Toronto boroughs and neighborhoods isn't explicitly available in a csv file or in an ordered file. However, we'll make use of some libraries to scraping some html file and extract that data.</p>
<p>Secondly, after have that data, we'll need clean it, dropping rows that doesn't help us and sorting it in a conveninant.</p>

In [359]:
#Install required packages to fetch html files and make calls to API's.
#!conda install -c conda-forge requests --yes 
# !conda install -c conda-forge folium --yes

In [360]:
#Import packages.
import pandas as pd
import numpy as np
import requests

import folium
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import json # To handle json files

from pandas.io.json import json_normalize # Convert json files into pandas dataframe

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans #To clustering

<p>For retriving data we'll need to provide the website url, and pass it through the requests method <i>"get"</i>.</p>

In [361]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)

Let's see what is the length of what returns the get request through store it in an response object.

In [362]:
len(response.content)

79293

The above result seems pretty messy, but there's many html tags in whose content we aren't interested in. Nevertheless, through the 'read_html' pandas method we can automatically convert the first table in the HTML file into a DataFrame. I've prefered to convert the data in this way, without having to install other packages. 

In [363]:
df = pd.read_html(response.text)[0] #Read the first table contained in response.text html file.
df.columns = ['Postal Code', 'Borough', 'Neighborhood'] # Name the Data Frame columns.
df.drop([0], axis=0, inplace=True) # Get rid of first column.
df.reset_index(inplace=True, drop=True) # Re set the index without creating a new columns called "index"
df.head() # See first 5 rows.

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [364]:
# Watch data frame size.
df.shape 

(288, 3)

As we can see above, there are many rows in Neighborhood and Borough columns that have 'Not assigned' values; that are useless rows, and therefore we'll get rid of them. However, we'll get rid of the rows that only lack the Borough value, because we can recover the values of the Neighborhood rows by copying their borough value, but not viceversa.

In [365]:
useless_rows = df[df['Borough'] == 'Not assigned']
print('Useless rows: {}'.format(useless_rows.shape[0]))
useless_rows.head()

Useless rows: 77


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
9,M8A,Not assigned,Not assigned
13,M2B,Not assigned,Not assigned
20,M7B,Not assigned,Not assigned


Now, let's see the number of rows that contains borough values; i.e. the useful rows.

In [366]:
df = df[df['Borough'] != 'Not assigned']
print('Useful rows: {}'.format(df.shape[0]))
df.head()

Useful rows: 211


Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Now, in order to recover some rows that have Borough value but lack of Neghborhood value, we can replace that 'Not assigned' value into its Borough value. We'll iterate through the dataset in order to do this.

In [367]:
index_neighs_to_replace_for_boroughs = df[(df['Borough'] != 'Not assigned') & (df['Neighborhood'] == 'Not assigned')].index.tolist()

for i in index_neighs_to_replace_for_boroughs:
    print('Neighborhood with this value: "'+ df['Neighborhood'][i] + '" will be replaced with this: "' + df['Borough'][i] + '" value')
    df['Neighborhood'][i] = df['Borough'][i]

Neighborhood with this value: "Not assigned" will be replaced with this: "Queen's Park" value


In [368]:
#Chek if 'Borough' and 'Neighbourhood' columns haven't 'Not assigned' values.
if all((Borough, Neighbourhood != 'Not assigned') for i, Borough, Neighbourhood in df[['Borough','Neighborhood']].itertuples()):
    print('All Borough and Neighbourhood have assigned values.')

All Borough and Neighbourhood have assigned values.


Now let's look into the data frame shape until this point.

In [369]:
df.shape

(211, 3)

In [370]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


Notice above that in 'Postalcode' column, there are repeated postal codes, we can resume this by aggrouping its corresonding Neighborhoods separated by commas, instaed of having repeated ones.

In [371]:
df = df.astype(str).set_index(['Postal Code', 'Borough'])
df_merged = df.groupby(level=['Postal Code', 'Borough'], sort=False).agg(', '.join)
df_merged.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighborhood
Postal Code,Borough,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Harbourfront, Regent Park"
M6A,North York,"Lawrence Heights, Lawrence Manor"
M7A,Queen's Park,Queen's Park


Now our dataset have unique postcodes and its pertaining Neighborhoods are separated by commas. However, we'll need to reset the index and recover the Postalcode back into the dataframes columns.

In [372]:
df_merged.reset_index(inplace=True)
df_merged.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


Finally, let's look at its shape; it has now more than a hundred of rows less; all because we've group its Neighborhoods in order to avoid postal codes duplicates.

In [373]:
df_merged.shape

(103, 3)

<h2>Geolocalization stage.</h2>

So far, our data frame is cleaned, but in order to plot maps and can analyze its ubicaction; we'll need to search for each geolocalization data for every postal code. We can do this through packages; however, for sake of simplicity, we've find in internet an csv file with the geolocalization data for each of our Postal Codes. Otherwise, if our case of study would required a great variarity of locations, we could use one of those that packages. 


In [374]:
#Import csv file
df_latlong = pd.read_csv('http://cocl.us/Geospatial_data')
df_latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now, in order to check if our localization data is row-wise equal to what we have in our dataframe, let's sort both by the Postal Code. 

In [375]:
df_latlong.sort_values(by='Postal Code').head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [376]:
df_merged.sort_values(by='Postal Code', inplace=True)
df_merged.reset_index(inplace=True, drop=True)
df_merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


As you can see above, sorting both dataframse and then concatenate will make that each row contain its pertain location.

In [377]:
df_places = pd.concat([df_merged, df_latlong[['Latitude', 'Longitude']]], axis=1)


In [378]:
df_places.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


<h2>Map visualization and fetching venues.</h2>

Firstly, let's import the required packages to visualize maps.

In [379]:
# !conda install -c conda-forge folium --yes 
# !conda install -c conda-forge geopy --yes

Before making calls to the Foursquare API, let's firstly take a look to the Toronto map and plot in it the different neighborhoods distributed in Toronto.

In [380]:
toronto_latlong = [43.7362379,-79.303433]
map_toronto = folium.Map(location=toronto_latlong, zoom_start=11.5)

# add markers to map
for lat, lng, label in zip(df_places['Latitude'], df_places['Longitude'], df_places['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Now that we've seen how the neighboorhoods are distibuted throughout the city, let's make a get request to the Foursquare API in order to get the different places there. Firstly, i'll need to define my credential info, therefore, for obvious reasons, i'll hide that cell.

<h3>Fetching venues from Foursquare API</h3>

In [381]:
# The code was removed by Watson Studio for sharing.

Now, let's take the most common Borough in our Toronto dataset and fetch the most common venues that it contains.

In [382]:
df_merged[['Borough','Neighborhood']].groupby(['Borough']).count().idxmax().to_string()

'Neighborhood    North York'

Hence we selected North York as the borough that we'll explore to, let's create a data frame containing all the data that we need from it.

In [383]:
northyk_data = df_places[df_places['Borough'] == 'North York'].reset_index(drop=True)
northyk_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M2H,North York,Hillcrest Village,43.803762,-79.363452
1,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556
2,M2K,North York,Bayview Village,43.786947,-79.385975
3,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714
4,M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493


Through geopy package, we can get the latitude and longitude for North York borough especifically.

In [384]:
address = 'North York, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of North York are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of North York are 43.7708175, -79.4132998.


Let's visualize the North York neighborhoods map:

In [385]:
map_northyk = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat, lng, label in zip(northyk_data['Latitude'], northyk_data['Longitude'], northyk_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_northyk)  

map_northyk

#### Let's explore the first neighborhood in our dataframe.

In [386]:
northyk_data.loc[0, 'Neighborhood']

'Hillcrest Village'

In [387]:
neighborhood_latitude = northyk_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = northyk_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = northyk_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Hillcrest Village are 43.8037622, -79.3634517.


Now, let's get the top 100 venues that are in Hillcrest Village within a radius of 1 kilometer.                               

In [388]:
radius = 1000
LIMIT = 100

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

Once we have donde the API get request, let's see the name of the first venue that we received.

In [389]:
results = requests.get(url).json()
results['response']['groups'][0]['items'][0]['venue']['name']

'고려삼계탕 Korean Ginseng Chicken Soup & Bibimbap'

In [390]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [391]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,고려삼계탕 Korean Ginseng Chicken Soup & Bibimbap,Korean Restaurant,43.798391,-79.369187
1,Tastee,Bakery,43.807722,-79.356798
2,Galati,Grocery Store,43.797831,-79.36941
3,Cummer Park,Park,43.799564,-79.371175
4,Tim Hortons,Coffee Shop,43.798945,-79.369644


In [392]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

22 venues were returned by Foursquare.


### Explore Neighborhoods in Northyork
#### Let's create a function to repeat the same process to all the neighborhoods in Northyork

In [395]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now, let's use the above function to retrieve the nearest venues in NorthYork. 

In [396]:
northyk_venues = getNearbyVenues(names=northyk_data['Neighborhood'],
                                   latitudes=northyk_data['Latitude'],
                                   longitudes=northyk_data['Longitude']
                                  )

Let's take a look into the dataframe that contains the venues data acquired.

In [397]:
print(northyk_venues.shape)
northyk_venues.head()

(243, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hillcrest Village,43.803762,-79.363452,Eagle's Nest Golf Club,43.805455,-79.364186,Golf Course
1,Hillcrest Village,43.803762,-79.363452,New York Fries,43.803664,-79.363905,Fast Food Restaurant
2,Hillcrest Village,43.803762,-79.363452,AY Jackson Pool,43.804515,-79.366138,Pool
3,Hillcrest Village,43.803762,-79.363452,Villa Madina,43.801685,-79.363938,Mediterranean Restaurant
4,Hillcrest Village,43.803762,-79.363452,Duncan Creek Park,43.805539,-79.360695,Dog Run


Now, let's see the number of venues that each neighborhood has: 

In [398]:
northyk_venues.groupby('Neighborhood')[['Venue']].count()

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
"Bathurst Manor, Downsview North, Wilson Heights",19
Bayview Village,4
"Bedford Park, Lawrence Manor East",23
"CFB Toronto, Downsview East",3
Don Mills North,4
Downsview Central,4
Downsview Northwest,5
Downsview West,4
"Downsview, North Park, Upwood Park",4
"Emery, Humberlea",2


<strong>Which are the most common venues categories?, i.e. Which are the top 10 venues categories that are most common present in North York Borough?</strong>

In [399]:
northyk_venues.groupby('Venue Category')[['Neighborhood']].count().sort_values(by='Neighborhood', axis=0, ascending=False).head(11)

Unnamed: 0_level_0,Neighborhood
Venue Category,Unnamed: 1_level_1
Coffee Shop,18
Fast Food Restaurant,12
Clothing Store,11
Restaurant,7
Japanese Restaurant,7
Pizza Place,6
Grocery Store,6
Café,5
Bank,5
Park,5


<b>From above table, we can see that the Coffee is the most common venuew category in this Borough. This could be helpful if, for example, we're looking for te most secure bussines in a Borough. However, what if we want to innovate and start competting on a more unexplored venue category?</b>  <span>Let's see the least common Venue Categories</span>

In [400]:
northyk_venues.groupby('Venue Category')[['Neighborhood']].count().sort_values(by='Neighborhood', axis=0, ascending=True).head(10)

Unnamed: 0_level_0,Neighborhood
Venue Category,Unnamed: 1_level_1
Accessories Store,1
Korean Restaurant,1
Indonesian Restaurant,1
Indian Restaurant,1
Ice Cream Shop,1
Hotel,1
Hockey Arena,1
Wings Joint,1
Golf Course,1
General Entertainment,1


From above result, we can infer that if we want to start competting in an unxplored, but not specified market in this Borough, some good options to have in count are: Accesories Stores, Ice Cream Shop, or General Entretainment	  

<strong>How much variaety does this Borough have in terms of venues?</strong>

In [401]:
print('There are {} uniques categories.'.format(len(northyk_venues['Venue Category'].unique())))

There are 108 uniques categories.


In [402]:
# one hot encoding
northyk_onehot = pd.get_dummies(northyk_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
northyk_onehot['Neighborhood'] = northyk_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [northyk_onehot.columns[-1]] + list(northyk_onehot.columns[:-1])
northyk_onehot = northyk_onehot[fixed_columns]

northyk_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Arcade,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,...,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Video Store,Vietnamese Restaurant,Wings Joint,Women's Store
0,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Hillcrest Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [403]:
northyk_grouped = northyk_onehot.groupby('Neighborhood').mean().reset_index()
northyk_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Arcade,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,...,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Video Game Store,Video Store,Vietnamese Restaurant,Wings Joint,Women's Store
0,"Bathurst Manor, Downsview North, Wilson Heights",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,...,0.052632,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,...,0.043478,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"CFB Toronto, Downsview East",0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Don Mills North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's take a look into the new shape of our dataset:

In [404]:
northyk_grouped.shape

(23, 109)

### Let's print each neighborhood along with the top 5 most common venues.

In [405]:
num_top_venues = 5

for hood in northyk_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = northyk_grouped[northyk_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Manor, Downsview North, Wilson Heights----
              venue  freq
0       Coffee Shop  0.11
1  Sushi Restaurant  0.05
2        Restaurant  0.05
3    Sandwich Place  0.05
4     Shopping Mall  0.05


----Bayview Village----
                 venue  freq
0   Chinese Restaurant  0.25
1                 Café  0.25
2  Japanese Restaurant  0.25
3                 Bank  0.25
4    Accessories Store  0.00


----Bedford Park, Lawrence Manor East----
                  venue  freq
0    Italian Restaurant  0.09
1  Fast Food Restaurant  0.09
2           Coffee Shop  0.09
3         Grocery Store  0.04
4               Butcher  0.04


----CFB Toronto, Downsview East----
         venue  freq
0      Airport  0.33
1         Park  0.33
2  Snack Place  0.33
3     Pharmacy  0.00
4    Pet Store  0.00


----Don Mills North----
                  venue  freq
0  Gym / Fitness Center  0.25
1  Caribbean Restaurant  0.25
2                  Café  0.25
3   Japanese Restaurant  0.25
4        Massage Studio 

### Let's put that into a *pandas* dataframe
First, let's write a function to sort the venues in descending order.

In [406]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [407]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = northyk_grouped['Neighborhood']

for ind in np.arange(northyk_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(northyk_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Downsview North, Wilson Heights",Coffee Shop,Fried Chicken Joint,Shopping Mall,Middle Eastern Restaurant,Frozen Yogurt Shop,Pet Store,Pharmacy,Pizza Place,Deli / Bodega,Bridal Shop
1,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Electronics Store,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store
2,"Bedford Park, Lawrence Manor East",Fast Food Restaurant,Italian Restaurant,Coffee Shop,Greek Restaurant,Indian Restaurant,Café,Liquor Store,Butcher,Pharmacy,Pizza Place
3,"CFB Toronto, Downsview East",Snack Place,Airport,Park,Electronics Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store
4,Don Mills North,Caribbean Restaurant,Gym / Fitness Center,Café,Japanese Restaurant,Women's Store,Electronics Store,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store


## Clustering
Run *k*-means to cluster the neighborhood into 2 clusters (basically because isn't as large as NewYork City dataset).

In [408]:
# set number of clusters
kclusters = 2

northyk_grouped_clustering = northyk_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(northyk_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 0, 1, 1, 1, 1, 0, 1], dtype=int32)

Now that we have our cluster labels for each row in our dataset, let's add it into our venues list.

In [409]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

northyk_merged = northyk_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
northyk_merged = northyk_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

In [410]:
northyk_merged.head()  # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M2H,North York,Hillcrest Village,43.803762,-79.363452,1.0,Golf Course,Pool,Athletics & Sports,Fast Food Restaurant,Mediterranean Restaurant,Dog Run,Women's Store,Discount Store,Comfort Food Restaurant,Construction & Landscaping
1,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,1.0,Clothing Store,Fast Food Restaurant,Coffee Shop,Bus Station,Asian Restaurant,Juice Bar,Restaurant,Jewelry Store,Japanese Restaurant,Bakery
2,M2K,North York,Bayview Village,43.786947,-79.385975,1.0,Chinese Restaurant,Café,Bank,Japanese Restaurant,Electronics Store,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store
3,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714,0.0,Park,Cafeteria,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop
4,M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493,,,,,,,,,,,


As you can see above; there's a row that contains NaN values; for some reason the algorithm couldn't classify it. So, to avoid numerical problems, let's make sure to drop the rows that contains this null value.

In [411]:
northyk_merged[northyk_merged['Cluster Labels'].isnull()]

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,M2M,North York,"Newtonbrook, Willowdale",43.789053,-79.408493,,,,,,,,,,,


Now, we're sure that only the row with index 4 has this problem; so let's get rid of it and reset the index.

In [412]:
northyk_merged.drop(northyk_merged[northyk_merged['Cluster Labels'].isnull()].index, inplace=True)

In [413]:
northyk_merged.reset_index(inplace=True, drop=True)
northyk_merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M2H,North York,Hillcrest Village,43.803762,-79.363452,1.0,Golf Course,Pool,Athletics & Sports,Fast Food Restaurant,Mediterranean Restaurant,Dog Run,Women's Store,Discount Store,Comfort Food Restaurant,Construction & Landscaping
1,M2J,North York,"Fairview, Henry Farm, Oriole",43.778517,-79.346556,1.0,Clothing Store,Fast Food Restaurant,Coffee Shop,Bus Station,Asian Restaurant,Juice Bar,Restaurant,Jewelry Store,Japanese Restaurant,Bakery
2,M2K,North York,Bayview Village,43.786947,-79.385975,1.0,Chinese Restaurant,Café,Bank,Japanese Restaurant,Electronics Store,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store
3,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714,0.0,Park,Cafeteria,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop
4,M2N,North York,Willowdale South,43.77012,-79.408493,1.0,Ramen Restaurant,Coffee Shop,Sandwich Place,Sushi Restaurant,Restaurant,Café,Middle Eastern Restaurant,Fast Food Restaurant,Indonesian Restaurant,Japanese Restaurant


Now, let's see how many Neihghborhoods the algorithm has classified in each cluster:

In [414]:
northyk_merged.groupby('Cluster Labels')[['Neighborhood']].count()

Unnamed: 0_level_0,Neighborhood
Cluster Labels,Unnamed: 1_level_1
0.0,5
1.0,18


Now, it's time to visualize it on the map:

In [415]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(northyk_merged['Latitude'], northyk_merged['Longitude'], northyk_merged['Neighborhood'], northyk_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Analyzing the clusters:
#### First cluster:

In [416]:
clus0 = northyk_merged.loc[northyk_merged['Cluster Labels'] == 0, northyk_merged.columns[[2] + list(range(5, northyk_merged.shape[1]))]]
clus0

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,"Silver Hills, York Mills",0.0,Park,Cafeteria,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop
5,York Mills West,0.0,Park,Electronics Store,Bank,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop
7,Parkwoods,0.0,Park,Pool,Food & Drink Shop,Fast Food Restaurant,Discount Store,Clothing Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop
12,"CFB Toronto, Downsview East",0.0,Snack Place,Airport,Park,Electronics Store,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store
20,"Downsview, North Park, Upwood Park",0.0,Park,Construction & Landscaping,Bakery,Massage Studio,Dog Run,Coffee Shop,Comfort Food Restaurant,Cosmetics Shop,Deli / Bodega,Department Store


#### Second cluster:

In [417]:
clus1 = northyk_merged.loc[northyk_merged['Cluster Labels'] == 1, northyk_merged.columns[[2] + list(range(5, northyk_merged.shape[1]))]]
clus1

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Hillcrest Village,1.0,Golf Course,Pool,Athletics & Sports,Fast Food Restaurant,Mediterranean Restaurant,Dog Run,Women's Store,Discount Store,Comfort Food Restaurant,Construction & Landscaping
1,"Fairview, Henry Farm, Oriole",1.0,Clothing Store,Fast Food Restaurant,Coffee Shop,Bus Station,Asian Restaurant,Juice Bar,Restaurant,Jewelry Store,Japanese Restaurant,Bakery
2,Bayview Village,1.0,Chinese Restaurant,Café,Bank,Japanese Restaurant,Electronics Store,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store
4,Willowdale South,1.0,Ramen Restaurant,Coffee Shop,Sandwich Place,Sushi Restaurant,Restaurant,Café,Middle Eastern Restaurant,Fast Food Restaurant,Indonesian Restaurant,Japanese Restaurant
6,Willowdale West,1.0,Pharmacy,Grocery Store,Pizza Place,Coffee Shop,Butcher,Discount Store,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega
8,Don Mills North,1.0,Caribbean Restaurant,Gym / Fitness Center,Café,Japanese Restaurant,Women's Store,Electronics Store,Construction & Landscaping,Cosmetics Shop,Deli / Bodega,Department Store
9,"Flemingdon Park, Don Mills South",1.0,Coffee Shop,Asian Restaurant,Gym,Beer Store,Bike Shop,Clothing Store,Chinese Restaurant,Dim Sum Restaurant,Discount Store,Restaurant
10,"Bathurst Manor, Downsview North, Wilson Heights",1.0,Coffee Shop,Fried Chicken Joint,Shopping Mall,Middle Eastern Restaurant,Frozen Yogurt Shop,Pet Store,Pharmacy,Pizza Place,Deli / Bodega,Bridal Shop
11,"Northwood Park, York University",1.0,Coffee Shop,Furniture / Home Store,Bar,Falafel Restaurant,Massage Studio,Dog Run,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega
13,Downsview West,1.0,Moving Target,Grocery Store,Bank,Shopping Mall,Dog Run,Coffee Shop,Comfort Food Restaurant,Construction & Landscaping,Cosmetics Shop,Deli / Bodega


Let's see what is the most common venue present in most quantity of neighborhoods. Or, in other words: <br /> 
<b>how many neighborhoods have this venue as the most common venue.</b><br />
We can see that 4 neighborhoods have a park as the most common venue.

In [418]:
clus0['1st Most Common Venue'].value_counts()

Park           4
Snack Place    1
Name: 1st Most Common Venue, dtype: int64

Now, let's do the same for our second cluster. In this zone, we see that the three neighborhoods have a coffee shop as the most common venue. However, notice that the major category of venues in this cluster is related to restaurants. <br />
In this cluster, there are 3 neighborhoods that have a Coffee shop as the most common venue.

In [419]:
clus1['1st Most Common Venue'].value_counts()

Coffee Shop                   3
Pizza Place                   2
Clothing Store                2
Liquor Store                  1
Ramen Restaurant              1
Pharmacy                      1
Fast Food Restaurant          1
Korean Restaurant             1
Chinese Restaurant            1
Arcade                        1
Construction & Landscaping    1
Moving Target                 1
Caribbean Restaurant          1
Golf Course                   1
Name: 1st Most Common Venue, dtype: int64

Now, let's repeat this process for the n-th most common venue. We can see that in our first cluster, the majority of neighborhoods have a cosmetics shop as their seventh and eight most common venue.  

In [420]:
for nth_most_common_venue in clus0.columns.tolist()[2:]:
    print(nth_most_common_venue + ": " + clus0.groupby(nth_most_common_venue)[['Neighborhood']].count().idxmax().values)

['1st Most Common Venue: Park']
['2nd Most Common Venue: Airport']
['3rd Most Common Venue: Bakery']
['4th Most Common Venue: Coffee Shop']
['5th Most Common Venue: Comfort Food Restaurant']
['6th Most Common Venue: Construction & Landscaping']
['7th Most Common Venue: Cosmetics Shop']
['8th Most Common Venue: Cosmetics Shop']
['9th Most Common Venue: Deli / Bodega']
['10th Most Common Venue: Department Store']


We can see that in our first cluster, the majority of neighborhoods have: Comfort Food Restaurant as their seixth and seventh most common venue.  

In [421]:
for nth_most_common_venue in clus1.columns.tolist()[2:]:
    print( nth_most_common_venue + ": " + clus1.groupby(nth_most_common_venue)[['Neighborhood']].count().idxmax().values)

['1st Most Common Venue: Coffee Shop']
['2nd Most Common Venue: Grocery Store']
['3rd Most Common Venue: Bank']
['4th Most Common Venue: Japanese Restaurant']
["5th Most Common Venue: Women's Store"]
['6th Most Common Venue: Comfort Food Restaurant']
['7th Most Common Venue: Comfort Food Restaurant']
['8th Most Common Venue: Construction & Landscaping']
['9th Most Common Venue: Cosmetics Shop']
['10th Most Common Venue: Deli / Bodega']


## Conclusions

We can then clasiffy these two clusters of North York Borough within Toronto based on their characteristics:
<ul>
    <li>First cluster: Here we found neighborhoods contiguos to the main avenue that is situated throghout the center of the North York borough. It's most distinct aspect is that the majority of its neighborhoods have parks as the most common venues. In addition of that, an airsport is situated right at the center of the zone.</li>
    <li>Second cluster: This cluster could be considerated as the conformed by the sorrounded neighborhoods that have more common venues, mainly coffe shops and many restaruants of gastromoic variety. </li>

In [422]:
map_clusters