# Coursera -- IBM Applied Data Science Capstone Project Week 3


This notebook will be mainly used for IBM course applied data science capstone week 3 project

#### * Please note:The interactive features of the notebook, such as folium maps, will not work in my repository on GitHub. To view my Jupyter notebook with JavaScript content rendered, please [click to view my notebook in nbviewer.](https://nbviewer.jupyter.org/github/J-C-Zh/Coursera-IBM-Data-Science-Professional-9-Applied-Data_Science_Capstone/blob/master/Coursera%20IBM%20Applied%20Data%20Science%20Capstone%20week%203.ipynb)

<h1>Table of contents</h1>

1.Scrape the Toronto postal code dataframe from Wikipedia page. The dataframe should have three columns:'Postal Code','Borough' and       'Neighbourhood'. 
  * Transform the table into a _pandas_ dataframe.
  * Ignore cells with a borough that is Not assigned.
  * For rows that have different neighbourhood names but have the same postal code, combine them into one row with the neighborhoods separated with a comma.
  * If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

2.Get the latitude and the longitude coordinates of each neighborhood, add 'Latitude' and 'Longitude' columns to the Toronto postal code dataframe.Visualize the data.

3.Explore and cluster the neighborhoods in Toronto. I will work with only boroughs that contain the word 'Toronto '.

## 1. Scrape the Toronto postal code dataframe.

In [1]:
import pandas as pd
from pandas import DataFrame

In [2]:
df_list = pd.read_html('http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M',header=0)

df_list is a list type data structure which contains all the tables from 'List of postal codes of Canada: M' webpage. We will only need the first table, which is df_list[0]

In [3]:
df = df_list[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Change the column names per project requirements.

In [4]:
df.rename(columns={"Community": "Borough"}, inplace = True)

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [5]:
df = df.loc[~(df['Borough'] == 'Not assigned')].reset_index(drop=True)
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Check if there is more than one neighbourhood existing in one postal code area.

In [6]:
len(df['Postal Code'].unique())

103

There is 103 unique postal codes and the df dataframe has 103 rows. So there is no duplicate postal code existing in our dataframe. Next, let's check if there is 'not assigned' neighbourhood.

In [7]:
df.loc[(df['Neighbourhood'] == 'Not assigned')].shape


(0, 3)

There is no 'not assigned' neighbourhood. The shape of our cleaned dataframe is

In [8]:
df.shape

(103, 3)

In [9]:
print('The dataframe has {} boroughs'.format(
        len(df['Borough'].unique())
    )
)

The dataframe has 10 boroughs


## 2. Get coordinates for each neighbourhood

In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighbourhood.There are two ways to get the geographical coordinates of the neighbourhoods:

1. Using the Geocoder Python package
2. Using the csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

I will use both methods to get the data

### 1. Using the Geocoder Python package

In [10]:
!pip install geocoder
print('geocoder installed!')

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 8.8MB/s ta 0:00:011
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
geocoder installed!


Let's create an empty dataframe called df_cor_geocoder. I will use this new dataframe to store 'Postal Code', 'latitude' and 'longitude' data.

In [11]:
# define the dataframe columns
column_names = ['Postal Code', 'Latitude', 'Longitude'] 

# instantiate the dataframe
df_cor_geocoder = pd.DataFrame(columns=column_names)


I will run a while loop to get cordinates for each postal code.

In [12]:
import geocoder # import geocoder

for code in df['Postal Code']:
    # initialize the variable to None
    lat_lng_coords = None
    # loop until I get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(code))
        lat_lng_coords = g.latlng

    latitude=lat_lng_coords[0]
    longitude=lat_lng_coords[1]
    df_cor_geocoder = df_cor_geocoder.append({'Postal Code': code,
                                          'Latitude': latitude,
                                          'Longitude': longitude}, ignore_index=True)


In [13]:
df_cor_geocoder = df_cor_geocoder.sort_values(by=['Postal Code']).reset_index(drop=True)
df_cor_geocoder.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.81153,-79.19552
1,M1C,43.78564,-79.15871
2,M1E,43.76575,-79.1752
3,M1G,43.7682,-79.21761
4,M1H,43.76969,-79.23944


In [14]:
df_cor_geocoder.shape

(103, 3)

### 2. Using the csv data
Read the csv data from provided link and store the 'Postal Code', 'Latitude','Longitude' data into a new dataframe called df_cor_csv

In [15]:
df_cor_csv = pd.read_csv('http://cocl.us/Geospatial_data')
df_cor_csv = df_cor_csv.sort_values(by=['Postal Code'])
df_cor_csv.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [16]:
df_cor_csv.shape

(103, 3)

### 3. Compare two cordinate dataframes

df_cor_geocoder and df_cor_csv have the same shape. They both work for getting codinate data. But I can see slightly different result by checking their first five rows. I can also confirm they are different by using _.equals_ function.

In [17]:
df_cor_geocoder.equals(df_cor_csv)

False

The above code returned False, which means the two dataframes have different data. Now I will check how big the difference is. First, Let me create a new dataframe call df_cor_diff. I will caculate the difference of latitude and longtitude data for each postal code and store the difference data in the new dataframe.

In [18]:
df_cor_diff = pd.DataFrame(columns=['Postal Code','Latitude_diff','Longitude_diff'])
df_cor_diff.head()

Unnamed: 0,Postal Code,Latitude_diff,Longitude_diff


In [19]:
for i in range(0,df_cor_csv.shape[0]):
    df_cor_diff = df_cor_diff.append({'Postal Code': df_cor_csv['Postal Code'][i],
                                    'Latitude_diff': round((df_cor_geocoder['Latitude'][i]-df_cor_csv['Latitude'][i]),3),
                                    'Longitude_diff': round((df_cor_geocoder['Longitude'][i]-df_cor_csv['Longitude'][i]),3)}, ignore_index=True)

df_cor_diff.head()

Unnamed: 0,Postal Code,Latitude_diff,Longitude_diff
0,M1B,0.005,-0.001
1,M1C,0.001,0.002
2,M1E,0.002,0.014
3,M1G,-0.003,-0.001
4,M1H,-0.003,0.0


In [20]:
print('The biggest latitude difference is {}, and the biggest longitude difference is {}\n'
      .format(abs(df_cor_diff['Latitude_diff'].max()), abs(df_cor_diff['Longitude_diff'].max())))
print('The row with biggest latitude difference:')
print(df_cor_diff.loc[abs(df_cor_diff['Latitude_diff'])==0.012])
print('\n')

print('The row with biggest longitude difference:')
print(df_cor_diff.loc[abs(df_cor_diff['Longitude_diff'])==0.23])
print('\n')

The biggest latitude difference is 0.012, and the biggest longitude difference is 0.23

The row with biggest latitude difference:
   Postal Code  Latitude_diff  Longitude_diff
68         M5V          0.012          -0.005
86         M7R          0.012           0.230


The row with biggest longitude difference:
   Postal Code  Latitude_diff  Longitude_diff
86         M7R          0.012            0.23




The biggest longitude differnce is large compared to latitude difference. More specifically, the two data sets give both biggest  logitude difference and latitude difference for postal code 'M7R'. Let's look more closely. 

In [21]:
print('Data from geocoder method for postal code \'M7R\'')
print(df_cor_geocoder.loc[df_cor_geocoder['Postal Code']=='M7R'])
print('\n')
print('Data from csv file for postal code \'M7R\'')
print(df_cor_csv.loc[df_cor_csv['Postal Code']=='M7R'])

Data from geocoder method for postal code 'M7R'
   Postal Code  Latitude  Longitude
86         M7R  43.64869  -79.38544


Data from csv file for postal code 'M7R'
   Postal Code   Latitude  Longitude
86         M7R  43.636966 -79.615819


By using an [online Latitude/Longtidue Calculator](https://www.nhc.noaa.gov/gccalc.shtml), I know that the distance between (43.64869  -79.38544) and (43.636966 -79.615819) is around 19 km. That is pretty large. For example, if we use geocoder data to calculate the distance between two postal codes, let's say,'M1B' and "M1C', the distance is only about 4 km. However, although the two different cordinate dataset are not 100% same, it won't affect how I visualize and cluster the data. Thus I will go ahead using geocoder dataframe to merge with previous df dataframe to create a new dataframe that has all the information needed for the map. 

In [22]:
# substitute df_cor_geocoder with df_cor_csv if you want to use csv data 
df_cor = df_cor_geocoder

### 4.Data Visualization

Merge df and df_cor as a new dataframe 

In [23]:
df_all = pd.merge(df,df_cor,on ='Postal Code')
df_all

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75188,-79.33036
1,M4A,North York,Victoria Village,43.73042,-79.31282
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65514,-79.36265
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72321,-79.45141
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66449,-79.39302
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.66277,-79.52831
6,M1B,Scarborough,"Malvern, Rouge",43.81153,-79.19552
7,M3B,North York,Don Mills,43.74929,-79.36169
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.70794,-79.31160
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.65736,-79.37818


In [24]:
!pip install "folium==0.11.0"
import folium

print('Folium installed and imported!')

Collecting folium==0.11.0
[?25l  Downloading https://files.pythonhosted.org/packages/a4/f0/44e69d50519880287cc41e7c8a6acc58daa9a9acf5f6afc52bcc70f69a6d/folium-0.11.0-py2.py3-none-any.whl (93kB)
[K     |████████████████████████████████| 102kB 7.3MB/s ta 0:00:011
Collecting branca>=0.3.0 (from folium==0.11.0)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0
Folium installed and imported!


In [25]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [26]:
address = 'Toronto,Canada'

geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [27]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df_all['Latitude'], df_all['Longitude'], df_all['Borough'], df_all['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## 3.Explore and cluster the neighborhoods in Toronto
Let's simplify the above map and segment and cluster only boroughs that contain the word Toronto.So let's slice the original dataframe and create a new dataframe of the data that has required boroughs name.

In [28]:
name_data = df_all[df_all['Borough'].str.contains('Toronto')].reset_index(drop=True)
name_data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65514,-79.36265
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66449,-79.39302
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.65736,-79.37818
3,M5C,Downtown Toronto,St. James Town,43.65143,-79.37557
4,M4E,East Toronto,The Beaches,43.67703,-79.29542


In [29]:
print('There are {} postal codes for boroughs that contain the word \'Toronto\' in their names.'.format(name_data.shape[0]))

There are 39 postal codes for boroughs that contain the word 'Toronto' in their names.


In [30]:
# create map of Borough that contain the world 'Toronto'
map_name = folium.Map(location=[latitude+0.02, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighbourhood in zip(name_data['Latitude'], name_data['Longitude'], name_data['Borough'], name_data['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_name)  
    
map_name

Next, we are going to start utilizing the Foursquare API to explore the neighbourhoods and segment them.

In [31]:
#hidden cell codes 
 
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 
CLIENT_SECRET:


In [32]:
# The code was removed by Watson Studio for sharing.

Let's create a function to get top 100 venues within a radius of 500 meters of all the neighbourhoods in name_data

In [33]:
import requests

radius = 500
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now write the code to run the above function on each neighbourhood and create a new dataframe called *name_venues*.

In [34]:
name_venues = getNearbyVenues(names=name_data['Neighbourhood'],
                                   latitudes=name_data['Latitude'],
                                   longitudes=name_data['Longitude']
                                  )
name_venues.shape


Regent Park, Harbourfront
Queen's Park, Ontario Provincial Government
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West, Forest Hill Road Park
High Park, The Junction South
North Toronto West, Lawrence Park
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
R

(1761, 7)

Let's check the size of the resulting dataframe.

In [35]:
print(name_venues.shape)
name_venues.head()

(1761, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65514,-79.36265,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65514,-79.36265,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65514,-79.36265,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
3,"Regent Park, Harbourfront",43.65514,-79.36265,The Yoga Lounge,43.655515,-79.364955,Yoga Studio
4,"Regent Park, Harbourfront",43.65514,-79.36265,Body Blitz Spa East,43.654735,-79.359874,Spa


Let's check how many venues were returned for each neighbourhood.

In [36]:
name_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,65,65,65,65,65,65
"Brockton, Parkdale Village, Exhibition Place",85,85,85,85,85,85
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",100,100,100,100,100,100
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",71,71,71,71,71,71
Central Bay Street,77,77,77,77,77,77
Christie,11,11,11,11,11,11
Church and Wellesley,78,78,78,78,78,78
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,26,26,26,26,26,26
Davisville North,8,8,8,8,8,8


Let's find out how many unique categories can be curated from all the returned venues

In [37]:
print('There are {} uniques categories.'.format(len(name_venues['Venue Category'].unique())))

There are 228 uniques categories.


Analyze each neighbourhood

In [38]:
# one hot encoding
name_onehot = pd.get_dummies(name_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to datafra
name_onehot['Neighbourhood'] = name_venues['Neighbourhood'] 

# move neighbourhood column to the first column
fixed_columns = [name_onehot.columns[-1]] + list(name_onehot.columns[:-1])
name_onehot = name_onehot[fixed_columns]

name_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
# new dataframe size
name_onehot.shape

(1761, 229)

Next, let's group rows by neighbourhood and by taking the mean of the frequency of occurrence of each category

In [40]:
name_grouped = name_onehot.groupby('Neighbourhood').mean().reset_index()
name_grouped

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.015385,0.0,0.0,0.0,...,0.0,0.015385,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015385
1,"Brockton, Parkdale Village, Exhibition Place",0.011765,0.0,0.0,0.0,0.0,0.0,0.0,0.023529,0.011765,...,0.0,0.011765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011765
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.01,0.03,...,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014085,...,0.014085,0.0,0.014085,0.0,0.0,0.0,0.0,0.014085,0.0,0.014085
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.012987,0.012987,0.0,0.0,...,0.0,0.0,0.0,0.012987,0.012987,0.012987,0.0,0.0,0.0,0.0
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.0,0.012821,0.012821,0.0,0.0,0.0,0.0,0.012821,0.0,...,0.0,0.0,0.0,0.0,0.012821,0.0,0.0,0.0,0.0,0.012821
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.04,0.0,0.0,0.01,0.0,0.0,0.0,...,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
# the grouped new size
name_grouped.shape

(38, 229)

In [42]:
#Let's print each neighbourhood along with the top 5 most common venuesnum_top_venues = 5
num_top_venues = 5

for hood in name_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = name_grouped[name_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
            venue  freq
0     Coffee Shop  0.09
1      Restaurant  0.05
2     Cheese Shop  0.03
3  Farmers Market  0.03
4  Breakfast Spot  0.03


----Brockton, Parkdale Village, Exhibition Place----
         venue  freq
0  Coffee Shop  0.06
1         Café  0.06
2          Bar  0.06
3   Restaurant  0.05
4    Nightclub  0.04


----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                venue  freq
0         Coffee Shop  0.10
1               Hotel  0.05
2          Restaurant  0.04
3  Italian Restaurant  0.03
4                Café  0.03


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
                venue  freq
0         Coffee Shop  0.08
1  Italian Restaurant  0.07
2                Café  0.06
3                 Bar  0.04
4          Restaurant  0.03


----Central Bay Street----
            venue  freq
0     Coffee Shop  0.12
1  Clothing Store  0.05
2     

Let's put that into a pandas dataframe.
First, let's write a function to sort the venues in descending order.

In [43]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [54]:
import numpy as np
# print the top 10 venues for each neighbour
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = name_grouped['Neighbourhood']

for ind in np.arange(name_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(name_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()


Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Restaurant,Cocktail Bar,Seafood Restaurant,Hotel,Cheese Shop,Breakfast Spot,Italian Restaurant,Beer Bar,Farmers Market
1,"Brockton, Parkdale Village, Exhibition Place",Coffee Shop,Bar,Café,Restaurant,Sandwich Place,Nightclub,Gift Shop,Lounge,Japanese Restaurant,Italian Restaurant
2,"Business reply mail Processing Centre, South C...",Coffee Shop,Hotel,Restaurant,Café,Italian Restaurant,Bar,Asian Restaurant,Thai Restaurant,Salon / Barbershop,Pub
3,"CN Tower, King and Spadina, Railway Lands, Har...",Coffee Shop,Italian Restaurant,Café,Bar,Electronics Store,Restaurant,French Restaurant,Gym / Fitness Center,Sandwich Place,Park
4,Central Bay Street,Coffee Shop,Clothing Store,Middle Eastern Restaurant,Bubble Tea Shop,Sandwich Place,Café,Bookstore,Hotel,Restaurant,Cosmetics Shop


Cluster Neighbourhoods:
Run k-means to cluster the neighbourhood into 5 clusters.

In [52]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

name_grouped_clustering = name_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(name_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues.

In [46]:
print(neighbourhoods_venues_sorted.shape)
print(name_data.shape)

(38, 11)
(39, 5)


In [47]:
'neighbourhoods_venues_sorted' dataframe contains less rows than 'name_merged' dataframe as there is not enough data available 
from Foursquare for all the neighbourhoods. So when I merge these two dataframes, the lacking rows from 'neighbourhoods_venues_sorted'
(incl. 'Cluster labels' column) are filled with NaN, which converts 'Cluster labels' column into FLOAT. This will cause error because
when I visualize the data, the cluster columns will work as indices for color list, which requries them to be integer. To fix this problem,
I will change the JOIN() parameter 'how' into 'right'(default is 'left'). This will fix everything as the extra rows from'name_merged' 
dataframe will be simply ignored upon merging.

In [55]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

name_merged = name_data

# merge name_grouped with name_data to add latitude/longitude for each neighbourhood
name_merged = name_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood',how='right')

# check the last columns!

name_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65514,-79.36265,0,Coffee Shop,Breakfast Spot,Yoga Studio,Theater,Pub,Distribution Center,Restaurant,Electronics Store,Event Space,Food Truck
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66449,-79.39302,0,Coffee Shop,Park,Café,Sandwich Place,Concert Hall,Museum,Chinese Restaurant,Salon / Barbershop,Clothing Store,Cocktail Bar
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.65736,-79.37818,0,Coffee Shop,Clothing Store,Café,Middle Eastern Restaurant,Cosmetics Shop,Japanese Restaurant,Diner,Ramen Restaurant,Bookstore,Furniture / Home Store
3,M5C,Downtown Toronto,St. James Town,43.65143,-79.37557,0,Coffee Shop,Café,Seafood Restaurant,Cocktail Bar,Restaurant,Gastropub,American Restaurant,Beer Bar,Moroccan Restaurant,Japanese Restaurant
4,M4E,East Toronto,The Beaches,43.67703,-79.29542,2,Health Food Store,Pub,Trail,Neighborhood,Yoga Studio,Dumpling Restaurant,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant


Finally let's visualize final cluster results.

In [56]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(name_merged['Latitude'], name_merged['Longitude'], name_merged['Neighbourhood'], name_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Examine the Clusters.
Now, I can examine each cluster and determine the discriminating venue categories that distinguish each cluster.

### Cluster 1

In [57]:
name_merged.loc[name_merged['Cluster Labels'] == 0, name_merged.columns[[1] + list(range(5, name_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,0,Coffee Shop,Breakfast Spot,Yoga Studio,Theater,Pub,Distribution Center,Restaurant,Electronics Store,Event Space,Food Truck
1,Downtown Toronto,0,Coffee Shop,Park,Café,Sandwich Place,Concert Hall,Museum,Chinese Restaurant,Salon / Barbershop,Clothing Store,Cocktail Bar
2,Downtown Toronto,0,Coffee Shop,Clothing Store,Café,Middle Eastern Restaurant,Cosmetics Shop,Japanese Restaurant,Diner,Ramen Restaurant,Bookstore,Furniture / Home Store
3,Downtown Toronto,0,Coffee Shop,Café,Seafood Restaurant,Cocktail Bar,Restaurant,Gastropub,American Restaurant,Beer Bar,Moroccan Restaurant,Japanese Restaurant
5,Downtown Toronto,0,Coffee Shop,Restaurant,Cocktail Bar,Seafood Restaurant,Hotel,Cheese Shop,Breakfast Spot,Italian Restaurant,Beer Bar,Farmers Market
6,Downtown Toronto,0,Coffee Shop,Clothing Store,Middle Eastern Restaurant,Bubble Tea Shop,Sandwich Place,Café,Bookstore,Hotel,Restaurant,Cosmetics Shop
7,Downtown Toronto,0,Café,Grocery Store,Coffee Shop,Playground,Candy Store,Athletics & Sports,Italian Restaurant,Baby Store,Falafel Restaurant,Elementary School
8,Downtown Toronto,0,Café,Hotel,Gym,Coffee Shop,Restaurant,Japanese Restaurant,American Restaurant,Gastropub,Salad Place,Asian Restaurant
9,West Toronto,0,Park,Grocery Store,Smoke Shop,Brazilian Restaurant,Bar,Bank,Bakery,Middle Eastern Restaurant,Athletics & Sports,Furniture / Home Store
10,Downtown Toronto,0,Coffee Shop,Hotel,Japanese Restaurant,Plaza,Restaurant,Aquarium,Park,Boat or Ferry,Deli / Bodega,IT Services


### Cluster 2

In [58]:
name_merged.loc[name_merged['Cluster Labels'] == 1, name_merged.columns[[1] + list(range(5, name_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Central Toronto,1,Park,Yoga Studio,Dumpling Restaurant,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant,Event Space,Ethiopian Restaurant
21,Central Toronto,1,French Restaurant,Park,Yoga Studio,Dumpling Restaurant,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant,Event Space


### Cluster 3

In [59]:
name_merged.loc[name_merged['Cluster Labels'] == 2, name_merged.columns[[1] + list(range(5, name_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,East Toronto,2,Health Food Store,Pub,Trail,Neighborhood,Yoga Studio,Dumpling Restaurant,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant


### Cluster 4

In [60]:
name_merged.loc[name_merged['Cluster Labels'] == 3, name_merged.columns[[1] + list(range(5, name_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,Central Toronto,3,Playground,Gym,Trail,Donut Shop,Fish & Chips Shop,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant,Event Space


### Cluster 5

In [61]:
name_merged.loc[name_merged['Cluster Labels'] == 4, name_merged.columns[[1] + list(range(5, name_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
23,Central Toronto,4,Playground,Gym Pool,Park,Garden,Yoga Studio,Dumpling Restaurant,Fast Food Restaurant,Farmers Market,Farm,Falafel Restaurant
