# Coursera -- IBM Applied Data Science Capstone Project Week 3


This notebook will be mainly used for the IBM course applied data science capstone week 3 project

#### * Please note: The interactive features of the notebook, such as folium maps, will not work in my repository on GitHub. To view my Jupyter notebook with JavaScript content rendered, please [click to view my notebook in nbviewer.](https://nbviewer.jupyter.org/github/J-C-Zh/Coursera-IBM-Data-Science-Professional-9-Applied-Data_Science_Capstone/blob/master/Coursera%20IBM%20Applied%20Data%20Science%20Capstone%20week%203.ipynb)

<h1>Table of contents</h1>

1. Scrape the Toronto postal code data frame from Wikipedia page. The data frame should have three columns:'Postal Code', 'Borough', and 'Neighbourhood'.

* Transform the table into a pandas data frame.
* Ignore cells with a borough that is Not assigned.
* For rows that have different neighborhood names but have the same postal code, combine them into one row with the neighborhoods separated with a comma.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

2. Get the latitude and the longitude coordinates of each neighborhood, add 'Latitude' and 'Longitude' columns to the Toronto postal code data frame. Visualize the data.

3. Explore and cluster the neighborhoods in Toronto. I will work with only boroughs that contain the word 'Toronto '.

## 1. Scrape the Toronto postal code dataframe (Part 1)

In [1]:
import pandas as pd
from pandas import DataFrame
import requests 
from bs4 import BeautifulSoup 

In [2]:
req = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M") 

soup = BeautifulSoup(req.content,'lxml') 
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['Postal Code'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Check if there is more than one neighborhood existing in one postal code area.

In [3]:
len(df['Postal Code'].unique())

103

There are 103 unique postal codes and the df data frame has 103 rows. So there is no duplicate postal code existing in our data frame. Next, let's check if there is a 'not assigned' neighborhood.

In [4]:
df.loc[(df['Neighborhood'] == 'Not assigned')].shape


(0, 3)

There is no 'not assigned' neighborhood. The shape of our cleaned data frame is

In [5]:
df.shape

(103, 3)

In [6]:
print('The dataframe has {} boroughs.'.format(
        len(df['Borough'].unique())
    )
)

The dataframe has 15 boroughs.


## 2. Get coordinates for each neighbourhood (Part 2)

To utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. There are two ways to get the geographical coordinates of the neighborhoods:

1. Using the Geocoder Python package
2. Using the CSV file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

I will use both methods to get the data.

### 1. Using the Geocoder Python package

In [7]:
!pip install geocoder
print('geocoder installed!')

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
geocoder installed!


Let's create an empty data frame called df_cor_geocoder. I will use this new data frame to store 'Postal Code', 'latitude', and 'longitude' data.

In [8]:
# define the dataframe columns
column_names = ['Postal Code', 'Latitude', 'Longitude'] 

# instantiate the dataframe
df_cor_geocoder = pd.DataFrame(columns=column_names)


I will run a while loop to get coordinates for each postal code.

In [9]:
import geocoder # import geocoder

for code in df['Postal Code']:
    # initialize the variable to None
    lat_lng_coords = None
    # loop until I get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(code))
        lat_lng_coords = g.latlng

    latitude=lat_lng_coords[0]
    longitude=lat_lng_coords[1]
    df_cor_geocoder = df_cor_geocoder.append({'Postal Code': code,
                                          'Latitude': latitude,
                                          'Longitude': longitude}, ignore_index=True)


In [10]:
df_cor_geocoder = df_cor_geocoder.sort_values(by=['Postal Code']).reset_index(drop=True)
df_cor_geocoder.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.81139,-79.19662
1,M1C,43.78574,-79.15875
2,M1E,43.76575,-79.1747
3,M1G,43.76812,-79.21761
4,M1H,43.76944,-79.23892


In [11]:
df_cor_geocoder.shape

(103, 3)

### 2. Using the csv data
Read the CSV data from provided link and store the 'Postal Code', 'Latitude','Longitude' data into a new data frame called df_cor_csv

In [12]:
df_cor_csv = pd.read_csv('https://cocl.us/Geospatial_data')
df_cor_csv = df_cor_csv.sort_values(by=['Postal Code'])
df_cor_csv.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [13]:
df_cor_csv.shape

(103, 3)

### 3. Compare two coordinate dataframes

df_cor_geocoder and df_cor_csv have the same shape. They both work for getting coordinated data. But I can see slightly different results by checking their first five rows. I can also confirm they are different by using the .equals function.

In [14]:
df_cor_geocoder.equals(df_cor_csv)

False

The above code returned False, which means the two data frames have different data. Now I will check how big the difference is. First, Let me create a new data frame call df_cor_diff. I will calculate the difference of latitude and longitude data for each postal code and store the difference data in the new data frame.

In [15]:
df_cor_diff = pd.DataFrame(columns=['Postal Code','Latitude_diff','Longitude_diff'])
df_cor_diff.head()

Unnamed: 0,Postal Code,Latitude_diff,Longitude_diff


In [16]:
for i in range(0,df_cor_csv.shape[0]):
    df_cor_diff = df_cor_diff.append({'Postal Code': df_cor_csv['Postal Code'][i],
                                    'Latitude_diff': round((df_cor_geocoder['Latitude'][i]-df_cor_csv['Latitude'][i]),3),
                                    'Longitude_diff': round((df_cor_geocoder['Longitude'][i]-df_cor_csv['Longitude'][i]),3)}, ignore_index=True)

df_cor_diff.head()

Unnamed: 0,Postal Code,Latitude_diff,Longitude_diff
0,M1B,0.005,-0.002
1,M1C,0.001,0.002
2,M1E,0.002,0.014
3,M1G,-0.003,-0.001
4,M1H,-0.004,0.001


In [17]:
print('The biggest latitude difference is {}, and the biggest longitude difference is {}\n'
      .format(abs(df_cor_diff['Latitude_diff'].max()), abs(df_cor_diff['Longitude_diff'].max())))
print('The row with biggest latitude difference:')
print(df_cor_diff.loc[abs(df_cor_diff['Latitude_diff'])==0.012])
print('\n')

print('The row with biggest longitude difference:')
print(df_cor_diff.loc[abs(df_cor_diff['Longitude_diff'])==0.23])
print('\n')

The biggest latitude difference is 0.012, and the biggest longitude difference is 0.23

The row with biggest latitude difference:
   Postal Code  Latitude_diff  Longitude_diff
68         M5V          0.012          -0.004
86         M7R          0.012           0.230


The row with biggest longitude difference:
   Postal Code  Latitude_diff  Longitude_diff
86         M7R          0.012            0.23




The biggest longitude difference is large compared to the latitude difference. More specifically, the two data sets give both the biggest longitude difference and latitude difference for postal code 'M7R'. Let's look more closely. 

In [18]:
print('Data from geocoder method for postal code \'M7R\'')
print(df_cor_geocoder.loc[df_cor_geocoder['Postal Code']=='M7R'])
print('\n')
print('Data from csv file for postal code \'M7R\'')
print(df_cor_csv.loc[df_cor_csv['Postal Code']=='M7R'])

Data from geocoder method for postal code 'M7R'
   Postal Code  Latitude  Longitude
86         M7R  43.64869  -79.38544


Data from csv file for postal code 'M7R'
   Postal Code   Latitude  Longitude
86         M7R  43.636966 -79.615819


By using an online Latitude/Longtidue Calculator, I know that the distance between (43.64869 -79.38544) and (43.636966 -79.615819) is around 19 km. That is pretty large. For example, if we use geocoder data to calculate the distance between two postal codes, let's say,'M1B' and "M1C', the distance is only about 4 km. However, although the two different coordinate datasets are not 100% the same, it won't affect how I visualize and cluster the data. Thus I will go ahead using the geocoder data frame to merge with the previous df data frame to create a new data frame that has all the information needed for the map.

In [19]:
# substitute df_cor_geocoder with df_cor_csv if you want to use csv data 
df_cor = df_cor_geocoder

### 4.Data Visualization

Merge df and df_cor as a new dataframe 

In [20]:
df_all = pd.merge(df,df_cor,on ='Postal Code')
df_all

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75245,-79.32991
1,M4A,North York,Victoria Village,43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
4,M7A,Queen's Park,Ontario Provincial Government,43.66253,-79.39188
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.65319,-79.51113
99,M4Y,Downtown Toronto,Church and Wellesley,43.66659,-79.38133
100,M7Y,East Toronto Business,Enclave of M4L,43.64869,-79.38544
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.63278,-79.48945


In [21]:
!pip install "folium==0.11.0"
import folium

print('Folium installed and imported!')

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Folium installed and imported!


In [22]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [23]:
address = 'Toronto,Canada'

geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [24]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df_all['Latitude'], df_all['Longitude'], df_all['Borough'], df_all['Neighborhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## 3.Explore and cluster the neighborhoods in Toronto (Part 3)
Let's simplify the above map and segment and cluster only boroughs that contain the word Toronto. So let's slice the original data frame and create a new data frame of the data that has the required borough's name.

In [25]:
name_data = df_all[df_all['Borough'].str.contains('Toronto')].reset_index(drop=True)
name_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804
2,M5C,Downtown Toronto,St. James Town,43.65215,-79.37587
3,M4E,East Toronto,The Beaches,43.67709,-79.29547
4,M5E,Downtown Toronto,Berczy Park,43.64536,-79.37306


In [26]:
print('There are {} postal codes for boroughs that contain the word \'Toronto\' in their names.'.format(name_data.shape[0]))

There are 39 postal codes for boroughs that contain the word 'Toronto' in their names.


In [27]:
# create map of Borough that contain the world 'Toronto'
map_name = folium.Map(location=[latitude+0.02, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighbourhood in zip(name_data['Latitude'], name_data['Longitude'], name_data['Borough'], name_data['Neighborhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_name)  
    
map_name

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

In [29]:
# The code was removed by Watson Studio for sharing.

Let's create a function to get the top 100 venues within a radius of 500 meters of all the neighborhoods in _name_data_

In [30]:
import requests

radius = 500
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now write the code to run the above function on each neighborhood and create a new data frame called *name_venues*.

In [31]:
name_venues = getNearbyVenues(names=name_data['Neighborhood'],
                                   latitudes=name_data['Latitude'],
                                   longitudes=name_data['Longitude']
                                  )
name_venues.shape


Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Richmond, Adelaide, King
Dufferin, Dovercourt Village
The Danforth  East
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
The Danforth West, Riverdale
Toronto Dominion Centre, Design Exchange
Brockton, Parkdale Village, Exhibition Place
India Bazaar, The Beaches West
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North & West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
University of Toronto, Harbord
Runnymede, Swansea
Moore Park, Summerhill East
Kensington Market, Chinatown, Grange Park
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Rosedale
Enclave of M5E
St. James Town, Cabbagetown
First Canadi

(1688, 7)

Let's check the size of the resulting data frame.

In [32]:
print(name_venues.shape)
name_venues.head()

(1688, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65512,-79.36264,Tandem Coffee,43.653559,-79.361809,Coffee Shop
1,"Regent Park, Harbourfront",43.65512,-79.36264,Roselle Desserts,43.653447,-79.362017,Bakery
2,"Regent Park, Harbourfront",43.65512,-79.36264,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
3,"Regent Park, Harbourfront",43.65512,-79.36264,The Yoga Lounge,43.655515,-79.364955,Yoga Studio
4,"Regent Park, Harbourfront",43.65512,-79.36264,Body Blitz Spa East,43.654735,-79.359874,Spa


Let's check how many venues were returned for each neighborhood.

In [33]:
name_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berczy Park,64,64,64,64,64,64
"Brockton, Parkdale Village, Exhibition Place",83,83,83,83,83,83
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",79,79,79,79,79,79
Central Bay Street,59,59,59,59,59,59
Christie,10,10,10,10,10,10
Church and Wellesley,80,80,80,80,80,80
"Commerce Court, Victoria Hotel",100,100,100,100,100,100
Davisville,27,27,27,27,27,27
Davisville North,7,7,7,7,7,7
"Dufferin, Dovercourt Village",16,16,16,16,16,16


Let's find out how many unique categories can be curated from all the returned venues

In [34]:
print('There are {} uniques categories.'.format(len(name_venues['Venue Category'].unique())))

There are 223 uniques categories.


Analyze each neighborhood

In [35]:
# one hot encoding
name_onehot = pd.get_dummies(name_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to datafra
name_onehot['Neighborhood'] = name_venues['Neighborhood'] 

# move neighbourhood column to the first column
fixed_columns = [name_onehot.columns[-1]] + list(name_onehot.columns[:-1])
name_onehot = name_onehot[fixed_columns]

name_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [36]:
# new dataframe size
name_onehot.shape

(1688, 223)

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category.

In [37]:
name_grouped = name_onehot.groupby('Neighborhood').mean().reset_index()
name_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint
0,Berczy Park,0.015625,0.0,0.0,0.0,0.0,0.015625,0.0,0.015625,0.0,...,0.0,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.012048,0.012048,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.012048,0.0,0.0,0.0,0.0,0.0,0.0
2,"CN Tower, King and Spadina, Railway Lands, Har...",0.012658,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.012658,0.0,0.012658,0.0,0.0,0.0,0.0,0.0
3,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016949,0.016949,...,0.0,0.0,0.0,0.0,0.0,0.016949,0.016949,0.016949,0.0,0.0
4,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Church and Wellesley,0.0125,0.0,0.0125,0.0125,0.0125,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"Commerce Court, Victoria Hotel",0.01,0.0,0.0,0.0,0.03,0.0,0.0,0.01,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0
7,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Dufferin, Dovercourt Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [38]:
# the grouped new size
name_grouped.shape

(39, 223)

In [39]:
#Let's print each neighbourhood along with the top 5 most common venuesnum_top_venues = 5
num_top_venues = 5

for hood in name_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = name_grouped[name_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berczy Park----
                venue  freq
0         Coffee Shop  0.09
1        Cocktail Bar  0.06
2  Seafood Restaurant  0.05
3              Bakery  0.05
4      Breakfast Spot  0.03


----Brockton, Parkdale Village, Exhibition Place----
         venue  freq
0         Café  0.06
1          Bar  0.06
2  Coffee Shop  0.05
3   Restaurant  0.05
4    Gift Shop  0.04


----CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport----
                  venue  freq
0    Italian Restaurant  0.08
1           Coffee Shop  0.06
2                  Café  0.05
3        Sandwich Place  0.04
4  Gym / Fitness Center  0.04


----Central Bay Street----
            venue  freq
0     Coffee Shop  0.12
1  Clothing Store  0.07
2     Pizza Place  0.05
3  Sandwich Place  0.03
4      Restaurant  0.03


----Christie----
           venue  freq
0           Café   0.3
1  Grocery Store   0.2
2     Playground   0.1
3    Coffee Shop   0.1
4    Candy Store   0.1


--

Let's put that into a pandas data frame.
First, let's write a function to sort the venues in descending order.

In [40]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [41]:
import numpy as np
# print the top 10 venues for each neighbour
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = name_grouped['Neighborhood']

for ind in np.arange(name_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(name_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Cocktail Bar,Seafood Restaurant,Bakery,Restaurant,Cheese Shop,Breakfast Spot,Pharmacy,Farmers Market,Beer Bar
1,"Brockton, Parkdale Village, Exhibition Place",Bar,Café,Coffee Shop,Restaurant,Gift Shop,Bakery,Sandwich Place,Supermarket,Japanese Restaurant,Furniture / Home Store
2,"CN Tower, King and Spadina, Railway Lands, Har...",Italian Restaurant,Coffee Shop,Café,Park,Sandwich Place,Bar,Gym / Fitness Center,French Restaurant,Grocery Store,Restaurant
3,Central Bay Street,Coffee Shop,Clothing Store,Pizza Place,Plaza,Cosmetics Shop,Middle Eastern Restaurant,Bubble Tea Shop,Sandwich Place,Restaurant,Café
4,Christie,Café,Grocery Store,Candy Store,Baby Store,Playground,Italian Restaurant,Coffee Shop,Elementary School,Eastern European Restaurant,Electronics Store


Cluster Neighborhoods:
Run k-means to cluster the neighborhood into 5 clusters.

In [42]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

name_grouped_clustering = name_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(name_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Let's create a new data frame that includes the cluster as well as the top 10 venues.

In [43]:
print(neighborhoods_venues_sorted.shape)
print(name_data.shape)

(39, 11)
(39, 5)


'neighborhoods_venues_sorted' data frame contains fewer rows than 'name_merged' data frame as there is not enough data available from Foursquare for all the neighborhoods. So when I merge these two data frames, the lacking rows from 'neighborhoods_venues_sorted' (incl. 'Cluster labels' column) are filled with NaN, which converts the 'Cluster labels' column into FLOAT. This will cause an error becausewhen I visualize the data, the cluster columns will work as indices for the color list, which requires them to be an integer. To fix this problem, I will change the JOIN() parameter 'how' into 'right'(default is 'left'). This will fix everything as the extra rows from'name_merged' data frame will be simply ignored upon merging.

In [44]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

name_merged = name_data

# merge name_grouped with name_data to add latitude/longitude for each neighbourhood
name_merged = name_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood',how='right')

# check the last columns!

name_merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264,1,Coffee Shop,Breakfast Spot,Thai Restaurant,Playground,Distribution Center,Pub,Restaurant,Electronics Store,Event Space,Spa
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804,1,Coffee Shop,Clothing Store,Café,Cosmetics Shop,Hotel,Italian Restaurant,Middle Eastern Restaurant,Japanese Restaurant,Fast Food Restaurant,Theater
2,M5C,Downtown Toronto,St. James Town,43.65215,-79.37587,1,Coffee Shop,Cosmetics Shop,Café,Italian Restaurant,Clothing Store,Cocktail Bar,Gastropub,Hotel,Moroccan Restaurant,American Restaurant
3,M4E,East Toronto,The Beaches,43.67709,-79.29547,1,Coffee Shop,Park,Health Food Store,Pub,Trail,Asian Restaurant,Wings Joint,Dumpling Restaurant,Eastern European Restaurant,Electronics Store
4,M5E,Downtown Toronto,Berczy Park,43.64536,-79.37306,1,Coffee Shop,Cocktail Bar,Seafood Restaurant,Bakery,Restaurant,Cheese Shop,Breakfast Spot,Pharmacy,Farmers Market,Beer Bar


Finally let's visualize final cluster results.

In [47]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(name_merged['Latitude'], name_merged['Longitude'], name_merged['Neighborhood'], name_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Examine the Clusters.
Now, I can examine each cluster and determine the discriminating venue categories that distinguish each cluster.

### Cluster 1

In [48]:
name_merged.loc[name_merged['Cluster Labels'] == 0, name_merged.columns[[1] + list(range(5, name_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,East York/East Toronto,0,Park,Intersection,Music Venue,Dog Run,Farm,Falafel Restaurant,Event Space,Ethiopian Restaurant,Escape Room,Elementary School
21,Central Toronto,0,Park,Wings Joint,Dog Run,Farmers Market,Farm,Falafel Restaurant,Event Space,Ethiopian Restaurant,Escape Room,Elementary School


### Cluster 2

In [49]:
name_merged.loc[name_merged['Cluster Labels'] == 1, name_merged.columns[[1] + list(range(5, name_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,1,Coffee Shop,Breakfast Spot,Thai Restaurant,Playground,Distribution Center,Pub,Restaurant,Electronics Store,Event Space,Spa
1,Downtown Toronto,1,Coffee Shop,Clothing Store,Café,Cosmetics Shop,Hotel,Italian Restaurant,Middle Eastern Restaurant,Japanese Restaurant,Fast Food Restaurant,Theater
2,Downtown Toronto,1,Coffee Shop,Cosmetics Shop,Café,Italian Restaurant,Clothing Store,Cocktail Bar,Gastropub,Hotel,Moroccan Restaurant,American Restaurant
3,East Toronto,1,Coffee Shop,Park,Health Food Store,Pub,Trail,Asian Restaurant,Wings Joint,Dumpling Restaurant,Eastern European Restaurant,Electronics Store
4,Downtown Toronto,1,Coffee Shop,Cocktail Bar,Seafood Restaurant,Bakery,Restaurant,Cheese Shop,Breakfast Spot,Pharmacy,Farmers Market,Beer Bar
5,Downtown Toronto,1,Coffee Shop,Clothing Store,Pizza Place,Plaza,Cosmetics Shop,Middle Eastern Restaurant,Bubble Tea Shop,Sandwich Place,Restaurant,Café
6,Downtown Toronto,1,Café,Grocery Store,Candy Store,Baby Store,Playground,Italian Restaurant,Coffee Shop,Elementary School,Eastern European Restaurant,Electronics Store
7,Downtown Toronto,1,Coffee Shop,Hotel,Café,Restaurant,Gym,Salad Place,Asian Restaurant,American Restaurant,Steakhouse,Japanese Restaurant
8,West Toronto,1,Park,Furniture / Home Store,Bakery,Pharmacy,Pizza Place,Middle Eastern Restaurant,Café,Smoke Shop,Liquor Store,Bus Line
10,Downtown Toronto,1,Coffee Shop,Park,Hotel,Japanese Restaurant,Aquarium,Boat or Ferry,Plaza,Theater,Deli / Bodega,Electronics Store


### Cluster 3

In [50]:
name_merged.loc[name_merged['Cluster Labels'] == 2, name_merged.columns[[1] + list(range(5, name_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
23,Central Toronto,2,Gym Pool,Playground,Park,Dive Bar,Farm,Falafel Restaurant,Event Space,Ethiopian Restaurant,Escape Room,Elementary School
33,Downtown Toronto,2,Gym / Fitness Center,Park,Playground,Tennis Court,Creperie,Dog Run,Farm,Falafel Restaurant,Event Space,Ethiopian Restaurant


### Cluster 4

In [51]:
name_merged.loc[name_merged['Cluster Labels'] == 3, name_merged.columns[[1] + list(range(5, name_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Central Toronto,3,Fast Food Restaurant,Home Service,Wings Joint,Donut Shop,Farmers Market,Farm,Falafel Restaurant,Event Space,Ethiopian Restaurant,Escape Room


### Cluster 5

In [52]:
name_merged.loc[name_merged['Cluster Labels'] == 4, name_merged.columns[[1] + list(range(5, name_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Central Toronto,4,Bus Line,Swim School,Wings Joint,Fish & Chips Shop,Farmers Market,Farm,Falafel Restaurant,Event Space,Ethiopian Restaurant,Escape Room
