# Data Science - Capstone Project

# Introduction:

<p Style='font-size:140%;line-height: 30px'> This notebook is created for the applied data science project, to explore and cluster the neighborhoods in Toronto.</p>

<hr style="height:2px;border-width:0;color:white;background-color:grey">

## PART I - First Exercise

<ol Style='font-size:120%;line-height: 25px'>
    <li> Scrape the data of postal codes table from the following Wikipedia page: </li>
    <a  href = https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M> https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M </a>
    <li> Transform the data obtained from Wikipedia page into a pandas dataframe</li> </ol>

### Scrape the data of postal codes table from the Wikipedia page:
<hr style="height:1px;border-width:0;color:white;background-color:brown">
<li Style='font-size:110%;color:brown'> Let us start with importing few essential libraries!! </li>

In [10]:
## Import required libraries
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np # library to handle data in a vectorized manner
print("Libraries Imported!")

Libraries Imported!


<li Style='font-size:110%;color:brown'> Let us install beautifulsoup4 pacakge, we will use it to scrape Wikipedia web page </li>

In [11]:
!pip install beautifulsoup4 # uncomment incase you haven't installed already

import requests # for reading webpage through url
from bs4 import BeautifulSoup # Pacakge requred for webpage Scraping

print("\nBeautifulSoup Imported!")


BeautifulSoup Imported!


<li Style='font-size:110%;color:brown'> Save the html page as BeautifulSoup object called 'wikidata'. </li>

In [12]:
## Use Wget function to get the HTML page
!wget -q -O 'Toranto_postal_codes.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
        
## Use BeautifulSoup Function
with open("Toranto_postal_codes.html") as file:
    wikidata = BeautifulSoup(file)

<li Style='font-size:110%;color:brown'> Explore the data from 'wikidata' object. </li>

In [13]:
## To explore html page copy, uncomment print commands below
#print(wikidata.contents);
print(wikidata.find_all('title'));print(wikidata.table.attrs.values());print(wikidata.table.attrs.keys(),"\n")
print(str(wikidata.find_all('table')[0].contents)[0:150])

[<title>List of postal codes of Canada: M - Wikipedia</title>]
dict_values([['wikitable', 'sortable']])
dict_keys(['class']) 

['\n', <tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighborhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td


<li Style='font-size:110%;color:brown'> Check the dataframes from 'wikidata' object which contains the required data </li>

In [14]:
## To find desired table, I have written loop to list down index and class of all the tabel
i=0
for data in wikidata.find_all('table:'):
    print('\n',data.name, i)
    #print(list(data.attrs.keys()))
    print('This is are header of tabel index no:',i,'and tabel class =',list(data.attrs.values()))
    l=[]
    for col_head in data.find_all('th'):
        l.append(str(col_head.text).strip('\n'))
    print('columns:',l)
    i=i+1

<li Style='font-size:110%;color:brown'> 
    Let us obtain the data, as now we know which table to query for Postalcode, Borough and Neighborhood.</li>

In [15]:
## Extract the requesite table (i.e create html-tag (html_table) using wikidata)
html_table = wikidata.find('table', attrs={'class':'wikitable sortable'})
print ('\n1. html_table is', type(html_table),':\n')

## lets read the html-tag (html_table) using pandas and convert it in to list (pd_list)
pd_list = pd.read_html(str(html_table))
print ('2. pd_list shown below is', type(pd_list),':\n\n',pd_list[0][:3],'\n')

## Now convert list (pd_list) in to pandas dataframe(df)
Toronto_org_df = pd.DataFrame(pd_list[0])
print ('3. Toronto_org_df shown below is',type(Toronto_org_df),':\n\n'
       ,Toronto_org_df.head(3),'\n')


1. html_table is <class 'bs4.element.Tag'> :

2. pd_list shown below is <class 'list'> :

   Postal Code       Borough  Neighborhood
0         M1A  Not assigned  Not assigned
1         M2A  Not assigned  Not assigned
2         M3A    North York     Parkwoods 

3. Toronto_org_df shown below is <class 'pandas.core.frame.DataFrame'> :

   Postal Code       Borough  Neighborhood
0         M1A  Not assigned  Not assigned
1         M2A  Not assigned  Not assigned
2         M3A    North York     Parkwoods 



### Transform the Data: Data Preprocessing and Cleansing
<hr style="height:1px;border-width:0;color:white;background-color:brown">
<b Style='font-size:110%;line-height: 25px'>Instruction:</b>
<ul Style='font-size:110%;line-height:25px'>
    <li Style='left-padding:1'> 
        Only process the cells that have an assigned borough. i.e. Ignore the one without a borough.</li>
    <li> If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.</li>
    <li> Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.</li>
    <li> In the last cell of your notebook, use the shape method to print the number of rows of your dataframe.</li> 
</ul>

<li Style='font-size:110%;color:brown'> 
    Process the rows that have an assigned Borough, and update Neighborhood in case missing.</li>

In [16]:
## View original data
print('\n','Original data:\n\n', Toronto_org_df.head(3)
      , '\n\nOriginal data has #',Toronto_org_df.shape[0],'rows.')

## Creating copy of dataframe to work. note:keep original for refrence
Toronto_data = Toronto_org_df.copy()

## View Data type and missing value etc.
#Toronto_data.info(); Toronto_data.describe()

## View the rows where borough is not assigned
#print(Toronto_data[Toronto_data['Borough']=='Not assigned']) # uncomment to view
print('\n#',len(Toronto_data[Toronto_data['Borough']=='Not assigned'])
      ,'row were ignored as borough is not assigned:\n')

## Retain only relevent rows where Borough is asssigned
Toronto_data = Toronto_data[Toronto_data['Borough']!='Not assigned'].reset_index(drop=True)  
print(' New dataframe:\n\n', Toronto_data.head(), 
      '\n\nNew dataframe has #',Toronto_data.shape[0],'rows.')


 Original data:

   Postal Code       Borough  Neighborhood
0         M1A  Not assigned  Not assigned
1         M2A  Not assigned  Not assigned
2         M3A    North York     Parkwoods 

Original data has # 180 rows.

# 77 row were ignored as borough is not assigned:

 New dataframe:

   Postal Code           Borough                                 Neighborhood
0         M3A        North York                                    Parkwoods
1         M4A        North York                             Victoria Village
2         M5A  Downtown Toronto                    Regent Park, Harbourfront
3         M6A        North York             Lawrence Manor, Lawrence Heights
4         M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government 

New dataframe has # 103 rows.


Double-click __here__ for notes on different methods and coding practice.

<!--
# View original data
print('\nOriginal Data:\n', Toronto_org_df.head(), '\n\nOriginal Data has #',Toronto_org_df.shape[0],'rows.')

# Creating copy of dataframe to work. note:keep original for refrence
Toronto_data = Toronto_org_df.copy()

# View Data type and missing value etc.
#Toronto_data.info(); Toronto_data.describe()
# Correct data format - if required 
#df[['Postal Code','Borough']] = df[['Postal Code','Borough']].astype("str") # "int","float","str"

# View the rows where borough is not assigned
#print(Toronto_data[Toronto_data['Borough']=='Not assigned']) # uncomment to view
print('\n#', len(Toronto_data[Toronto_data['Borough']=='Not assigned']),'row were ignored as borough is not assigned:')

# Query index & drop the rows where borough is not assigned
Toronto_data.drop(Toronto_data[Toronto_data['Borough']=='Not assigned'].index,axis=0, inplace=True)
Toronto_data.reset_index(drop=True, inplace=True) # Reset index, because we droped few rows

# OR Retain only relevent rows where Borough is asssigned
#Toronto_data = Toronto_data[Toronto_data['Borough']!='Not assigned'].reset_index(drop=True) # easy way  

print('\nNew dataframe:\n', Toronto_data.head(), '\n\nNew dataframe has #',Toronto_data.shape[0],'rows.') -->

<li Style='font-size:110%;color:brown'> Assign Neighborhood same as the Borough, in case Neighborhood missing.</li>

In [17]:
## View if any Neighborhood is missing
# print(Toronto_data['Neighborhood']) # Uncomment to view, not found any missing value

## Check if null values in Neighborhood column
#Toronto_data['Neighborhood'].isnull() # Uncomment to view, not found any Null value
print('\nIs there any null values found in Neighborhood column ?\nNote: True count represents nulls.',
      '\n\n',Toronto_data['Neighborhood'].isnull().value_counts())

## Rename Postal Code  column to remove space
Toronto_data.columns = ['PostalCode', 'Borough', 'Neighborhood']


Is there any null values found in Neighborhood column ?
Note: True count represents nulls. 

 False    103
Name: Neighborhood, dtype: int64


In [18]:
## View Final data frame
print('\nToronto dataframe:')
Toronto_data.head(12)


Toronto dataframe:


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


<li Style='font-size:110%;color:brown'> Use the shape method to print the number of rows of your dataframe.</li>

In [19]:
print('\nNew dataframe has #',Toronto_data.shape[0],'rows.')


New dataframe has # 103 rows.


<h6 style='text-align:center'> End of PART I </h6>
<hr style="height:2px;border-width:0;color:white;background-color:grey">

## PART 2 - Second Exercise

<p Style='font-size:130%;line-height: 25px'> Get the coordinates of each neighborhood and include it in the dataframe, in order to fetch the Foursquare location data</p>
<ol Style='font-size:120%;line-height: 25px'>
    <li> Use the Geocoder Python package for geocoding:
    <a href='https://geocoder.readthedocs.io/index.html'> https://geocoder.readthedocs.io/index.html </a> </li> 
    <em Style='font-size:90%'> Note: Given that this package can be very unreliable, in case you are not able to get the coordinates, here is a link to a csv file that has the geographical coordinates of each postal code:</em> 
    <a href='http://cocl.us/Geospatial_data'> http://cocl.us/Geospatial_data </a>
    <li> Create the dataframe, using the Geocoder package or the csv file:</li>
</ol>
<hr style="height:1px;border-width:0;color:white;background-color:brown">

<li Style='font-size:110%;color:brown'> Let us start with installing Geocoder Pacakge!! </li>

In [20]:
## Install Geocoder
!pip install geocoder
import geocoder



<li Style='font-size:110%;color:brown'>Test the Geocoder by using it to geocode one adress out of few mentioned below  </li>

In [21]:
## Mountain View, CA', '102 North End Ave, New York, NY', 'M5G, Downtown Toronto, Central Bay Street'
g = geocoder.google('Mountain View, CA') 
print(g.latlng)

None


<li Style='font-size:110%;color:brown'> Let us test Google maps Geocoding API, as Geocoder returns none for each of the adress I tried to geocode. </li>

In [22]:
import requests
url = 'https://maps.googleapis.com/maps/api/geocode/json'
params = {'sensor': 'false', 'address': 'Mountain View, CA'}
r = requests.get(url, params=params)
results = r.json()
results

{'error_message': 'You must use an API key to authenticate each request to Google Maps Platform APIs. For additional information, please refer to http://g.co/dev/maps-no-account',
 'results': [],
 'status': 'REQUEST_DENIED'}

<li Style='font-size:110%;color:brown'> Let us get the coordinates from CSV file, seems Gecoder isn't working</li> 
<em Style='left-padding:1;font-size:110%'> Note: we might use geocoder from Conda, Nominatim module of Geopy package </em>

In [23]:
## Let's download the data and save it as a CSV file called Toronto_geo.csv
!wget -q -O 'Toronto_geo.csv' http://cocl.us/Geospatial_data

## Now that the data is downloaded, let's read it into a pandas dataframe.
Toronto_geo = pd.read_csv("Toronto_geo.csv")
#Toronto_geo.head()

In [24]:
## Create new DataFrame & bring Lat, Long to our DataFrame

## Print the # records to check if both dataset has equal rows
print('\nMake sure no. of records matches across Dataframes:'
      ,'\n  # Records in original DataFrame:',Toronto_data.shape[0]
      ,'\n  # Records in cordinates DataFrame:',Toronto_geo.shape[0])

## Joining the DataFrames
Toronto_data_final = Toronto_data.join(Toronto_geo.set_index('Postal Code')
                                       , on='PostalCode', lsuffix='_l', rsuffix='_r')

## Print the final DataFrame's information to check it has any missing or nulls
print('  # Records in final DataFrame:',Toronto_data_final.shape[0],'\n\n Details:')
#print(Toronto_data_final.info(),'\n')

## Veiw Final Dataframe
Toronto_data_final.head(12)
#Toronto_data_final.to_csv('net.csv', index=False) ## Saving the Final Dataframe as CSV


Make sure no. of records matches across Dataframes: 
  # Records in original DataFrame: 103 
  # Records in cordinates DataFrame: 103
  # Records in final DataFrame: 103 

 Details:


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


<h6 style='text-align:center'> End of PART II </h6>
<hr style="height:2px;border-width:0;color:white;background-color:grey">

## PART 3 - Third Exercise
<p Style='font-size:130%;line-height: 25px'> Explore and cluster the neighborhoods in Toronto, replicate the analysis as we did for the New York.</p> 
<ol Style='font-size:120%;line-height: 25px'>
    <li> Explore Toronto dataset and Select borough to cluster.</li>
    <li> Explore neighborhoods in selected borough.</li>
    <li> Analyze each neighborhood.</li>
    <li> Cluster neighborhoods.</li>
    <li> Examine clusters.</li>
</ol>
<b Style='font-size:130%;line-height: 25px'>Instruction:</b>
    <ul Style='font-size:110%;line-height:25px'>        
        <li Style='left-padding:1'> You can select data to work, such as the boroughs that contain the word Toronto.</li>
        <li> Add Markdown cells to explain what you decided to do and to report any observations you make.</li>
        <li> Generate maps to visualize your neighborhoods and how they cluster together.</li>
    </ul>

### 1. Explore Toranto dataset, and select the borough to work with:
<hr style="height:1px;border-width:0;color:white;background-color:brown">

<li Style='font-size:110%;color:brown'> 
    Before we get the data and start exploring it, let's get all the libraries that we will need. </li>

In [25]:
# Imrorting Libraries  
import json # library to handle JSON files
            
#!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!pip install folium==0.5.0
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<li Style='font-size:110%;color:brown'> Let's take a detail view of boroughs in Toronto. </li>

In [26]:
## Checking the number of Boroughs and Neighborhoods within each. 
print('\nThe Toronto data has {} boroughs and {} neighborhoods.'.format(
    len(Toronto_data_final['Borough'].unique()),Toronto_data_final.shape[0]))

print('\n # of Neighborhoods in each Borough:\n')
print(Toronto_data_final[['Borough','Neighborhood']].groupby('Borough',as_index=False).count(),'\n')

#print(Toronto_data_final[Toronto_data_final['Borough']=='Central Toronto']['Neighborhood'])


The Toronto data has 10 boroughs and 103 neighborhoods.

 # of Neighborhoods in each Borough:

            Borough  Neighborhood
0   Central Toronto             9
1  Downtown Toronto            19
2      East Toronto             5
3         East York             5
4         Etobicoke            12
5       Mississauga             1
6        North York            24
7       Scarborough            17
8      West Toronto             6
9              York             5 



<li Style='font-size:110%;color:brown'> Let's ceate a map of Toronto with all neighborhoods superimposed on top. </li>
<Ol Style='font-size:100%;line-height:25px'> 
    <li> Get the latitude and longitude values of Toronto City using <b>geopy library</b> </li>
    <li> We will define an instance of the Geocoder, and user_agent named <b>Toronto_explorer</b>.</li>
    <li> Generate a map using <b>folium</b>.</li>
</Ol>

In [27]:
## Get Coordinates of Toranto city 
address = 'Toronto City, Canada' ## or ('Toronto City, ON')
geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address); latitude = location.latitude; longitude = location.longitude

print('\nThe geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude),'\n')
## creating a copy of Dataframe as neighborhoods
neighborhoods = Toronto_data_final.copy()
#neighborhoods = Toronto_data_final[Toronto_data_final['Borough']=='Central Toronto'].copy()

## create map of Toronto using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude']
                                           , neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)
map_Toronto


The geograpical coordinate of Toronto City are 43.6534817, -79.3839347. 



<li Style='font-size:110%;color:brown'> Next, we will simplify the above map and segment and cluster only the neighborhoods in <b> Central Toronto </b> borough.</li>

<p Style='font-size:100%;'> 
    Let's slice the original dataframe to create a new one for Central Toronto borough.</p>

In [28]:
print('\nNew Dataframe: CentralToronto_df\n')
CentralToronto_df = Toronto_data_final[Toronto_data_final['Borough']=='Central Toronto'].reset_index(drop=True)
CentralToronto_df.iloc[:,1:]


New Dataframe: CentralToronto_df



Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Central Toronto,Lawrence Park,43.72802,-79.38879
1,Central Toronto,Roselawn,43.711695,-79.416936
2,Central Toronto,Davisville North,43.712751,-79.390197
3,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",43.696948,-79.411307
4,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
5,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678
6,Central Toronto,Davisville,43.704324,-79.38879
7,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
8,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


<li Style='font-size:110%;color:brown'> Let's create map for Central Toronto and visualize the neighborhoods in it, as we did with Toronto City </li>

In [29]:
## Get Coordinates of Central Toronto. 
address = 'Central Toronto, Canada' ## or ('Toronto City, ON')
geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address); latitude = location.latitude; longitude = location.longitude

print('\nThe geograpical coordinate of Central Toronto are {}, {}.'.format(latitude, longitude),'\n')
## creating a copy of Dataframe as neighborhoods
neighborhoods = CentralToronto_df.copy()

## Create map of Centrel Toronto using latitude and longitude values of borough
#map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

## I hae used one of its neighboorhood, for better view
map_Toronto = folium.Map(location=[43.696948,-79.411307], zoom_start=13) # Forest-Hill N :Coordinates
 
## add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude']
                                           , neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)
map_Toronto


The geograpical coordinate of Central Toronto are 43.6534817, -79.3839347. 



<li Style='font-size:110%;color:brown'> Next, we are going to start utilizing the <b>Foursquare API</b> to explore the neighborhoods and segment them.</li>

__Define Foursquare Credentials and Version__    
Make sure that you have created a Foursquare developer account and have your credentials handy

In [30]:
# The code was removed by Watson Studio for sharing.

In [31]:
# CLIENT_ID = 'xxxxxxxxxxxxx' # your Foursquare ID
# CLIENT_SECRET = 'xxxxxxxxxxxxx' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30

__Let's explore the first neighborhood in our dataframe.__  
Get the neighborhood's name.

In [32]:
print('\nFirst Neighborhood is',CentralToronto_df.loc[0, 'Neighborhood'],".")


First Neighborhood is Lawrence Park .


__Get the neighborhood's latitude and longitude values.__

In [33]:
neighborhood_latitude = CentralToronto_df.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = CentralToronto_df.loc[0, 'Longitude'] # neighborhood longitude value
neighborhood_name = CentralToronto_df.loc[0, 'Neighborhood'] # neighborhood name

print('\nLatitude and longitude values of {} are {}, {}.'.format
      (neighborhood_name, neighborhood_latitude,neighborhood_longitude)) 


Latitude and longitude values of Lawrence Park are 43.7280205, -79.3887901.


__Now, let's get the top 100 venues that are in Lawrence Park within a radius of 500 meters.__   
First, let's create the GET request URL. Name your URL **url**.

In [34]:
## Define LIMIT query and radius 
radius = 500; LIMIT = 100

## Define the corresponding URL
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
    CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION, radius, LIMIT)
#url

__Send the GET request and examine the resutls__

In [35]:
results = requests.get(url).json()
#results.keys()
results['response']['groups'][0]['items'][0]

{'reasons': {'count': 0,
  'items': [{'summary': 'This spot is popular',
    'type': 'general',
    'reasonName': 'globalInteractionReason'}]},
 'venue': {'id': '50e6da19e4b0d8a78a0e9794',
  'name': 'Lawrence Park Ravine',
  'location': {'address': '3055 Yonge Street',
   'crossStreet': 'Lawrence Avenue East',
   'lat': 43.72696303913755,
   'lng': -79.39438246708775,
   'labeledLatLngs': [{'label': 'display',
     'lat': 43.72696303913755,
     'lng': -79.39438246708775}],
   'distance': 465,
   'cc': 'CA',
   'city': 'Toronto',
   'state': 'ON',
   'country': 'Canada',
   'formattedAddress': ['3055 Yonge Street (Lawrence Avenue East)',
    'Toronto ON',
    'Canada']},
  'categories': [{'id': '4bf58dd8d48988d163941735',
    'name': 'Park',
    'pluralName': 'Parks',
    'shortName': 'Park',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/park_',
     'suffix': '.png'},
    'primary': True}],
  'photos': {'count': 0, 'groups': []}},
 'referralId': 'e-0-50

__As we know Foursquare lab in the previous module, we know that all the information is in the *items* key.__  
*Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.*

In [36]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

__Now we are ready to clean the json and structure it into a *pandas* dataframe.__

In [37]:
## Get Venues detail
venues = results['response']['groups'][0]['items']  
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

# View Near by Venue 
print('\nNearby_venues dataframe for venues of Lawrence Park:\n')
nearby_venues.head()


Nearby_venues dataframe for venues of Lawrence Park:



Unnamed: 0,name,categories,lat,lng
0,Lawrence Park Ravine,Park,43.726963,-79.394382
1,Zodiac Swim School,Swim School,43.728532,-79.38286
2,TTC Bus #162 - Lawrence-Donway,Bus Line,43.728026,-79.382805


__how many venues were returned by Foursquare?__

In [38]:
print('\n# {} venues were returned by Foursquare for {} .'.format(nearby_venues.shape[0],neighborhood_name))


# 3 venues were returned by Foursquare for Lawrence Park .


### 2. Explore Neighborhoods of Central Toronto:
<hr style="height:1px;border-width:0;color:white;background-color:brown">

<li Style='font-size:110%;color:brown'> 
    Let's create a function to repeat the same process to all the neighborhoods in Central Toront </li>

In [39]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([
            (name, lat, lng, v['venue']['name'], v['venue']['location']['lat'],
              v['venue']['location']['lng'], v['venue']['categories'][0]['name']) 
            for v in results])
    
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude',
                             'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
    print('New dataframe for nearby venues is created.')
    return(nearby_venues)

<li Style='font-size:110%;color:brown'> 
    Now,Let's use the above function on each neighborhood and create a new dataframe called <b>CenToront_venues</b>.</li>

In [40]:
## Create new Dataframe using getNearbyVenues function
CenToront_venues = getNearbyVenues( names=CentralToronto_df['Neighborhood'],
                                    latitudes=CentralToronto_df['Latitude'],
                                    longitudes=CentralToronto_df['Longitude']
                                  )
## Print size of the resulting dataframe
print('This dataframe of Venues has {} rows and {} columns:'.format(CenToront_venues.shape[0]
                                                                    ,CenToront_venues.shape[1]))
## Take a look to dataframe
CenToront_venues.head()

New dataframe for nearby venues is created.
This dataframe of Venues has 110 rows and 7 columns:


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Lawrence Park,43.72802,-79.38879,Lawrence Park Ravine,43.726963,-79.394382,Park
1,Lawrence Park,43.72802,-79.38879,Zodiac Swim School,43.728532,-79.38286,Swim School
2,Lawrence Park,43.72802,-79.38879,TTC Bus #162 - Lawrence-Donway,43.728026,-79.382805,Bus Line
3,Roselawn,43.711695,-79.416936,Ceiling Champions,43.713891,-79.420702,Home Service
4,Roselawn,43.711695,-79.416936,Rosalind's Garden Oasis,43.712189,-79.411978,Garden


__Let's check how many venues were returned for each neighborhood__

In [41]:
print('# of Venues in each Neighborhood:')
CenToront_venues.groupby('Neighborhood').count()[['Venue']].sort_values(by='Venue',ascending=False)

# of Venues in each Neighborhood:


Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
Davisville,31
"The Annex, North Midtown, Yorkville",22
"North Toronto West, Lawrence Park",19
"Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park",17
Davisville North,9
"Forest Hill North & West, Forest Hill Road Park",4
Lawrence Park,3
Roselawn,3
"Moore Park, Summerhill East",2


__Let's find out how many unique categories can be curated from all the returned venues__

In [42]:
print('\nThere are {} uniques categories of venues.'.format(len(CenToront_venues['Venue Category'].unique())))


There are 61 uniques categories of venues.


### 3. Analyze Each Neighborhood:
<hr style="height:1px;border-width:0;color:white;background-color:brown">

<li Style='font-size:110%;color:brown'> 
    Now that we have the detail of venues for each neighborhood, We will analyze them in order to cluster later on.

__Create dataframe to overview the category of the venues at each Neighborhoo.__  
Applying pandas, get_dummies() function .i.e. one hot encoding technique

In [43]:
# one hot encoding
CenToront_onehot = pd.get_dummies(CenToront_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
CenToront_onehot['Neighborhood'] = CenToront_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [CenToront_onehot.columns[-1]] + list(CenToront_onehot.columns[:-1])
CenToront_onehot = CenToront_onehot[fixed_columns]

## print size of dataframe
print('\nNew dataframe has {} rows and {} columns:'.format(CenToront_onehot.shape[0]
                                                         ,CenToront_onehot.shape[1]),'\n')
## Take a look to dataframe
CenToront_onehot.head()
#CenToront_onehot[list(CenToront_onehot.columns[:15])+ [CenToront_onehot.columns[-1]]]


New dataframe has 110 rows and 62 columns: 



Unnamed: 0,Neighborhood,American Restaurant,BBQ Joint,Bagel Shop,Bank,Breakfast Spot,Brewery,Burger Joint,Bus Line,Café,Chinese Restaurant,Clothing Store,Coffee Shop,Convenience Store,Department Store,Dessert Shop,Diner,Donut Shop,Farmers Market,Fast Food Restaurant,Food & Drink Shop,Fried Chicken Joint,Garden,Gas Station,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,History Museum,Home Service,Hotel,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Jewelry Store,Light Rail Station,Liquor Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Park,Pharmacy,Pizza Place,Pub,Rental Car Location,Restaurant,Salon / Barbershop,Sandwich Place,Seafood Restaurant,Spa,Sporting Goods Shop,Sports Bar,Supermarket,Sushi Restaurant,Swim School,Thai Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Yoga Studio
0,Lawrence Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Lawrence Park,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,Lawrence Park,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Roselawn,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Roselawn,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


__Next, let's find out the mean frequency of occurrence of each category.__   
Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category.

In [44]:
CenToront_grouped = CenToront_onehot.groupby('Neighborhood').mean().reset_index()
print('mean values dataframe:')
CenToront_grouped.head() ##[CenToront_grouped.columns[0:10]]

mean values dataframe:


Unnamed: 0,Neighborhood,American Restaurant,BBQ Joint,Bagel Shop,Bank,Breakfast Spot,Brewery,Burger Joint,Bus Line,Café,Chinese Restaurant,Clothing Store,Coffee Shop,Convenience Store,Department Store,Dessert Shop,Diner,Donut Shop,Farmers Market,Fast Food Restaurant,Food & Drink Shop,Fried Chicken Joint,Garden,Gas Station,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,History Museum,Home Service,Hotel,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Jewelry Store,Light Rail Station,Liquor Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Park,Pharmacy,Pizza Place,Pub,Rental Car Location,Restaurant,Salon / Barbershop,Sandwich Place,Seafood Restaurant,Spa,Sporting Goods Shop,Sports Bar,Supermarket,Sushi Restaurant,Swim School,Thai Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Yoga Studio
0,Davisville,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.064516,0.0,0.0,0.064516,0.0,0.0,0.096774,0.032258,0.0,0.032258,0.0,0.0,0.0,0.0,0.032258,0.032258,0.032258,0.0,0.064516,0.0,0.0,0.0,0.0,0.0,0.032258,0.064516,0.0,0.0,0.0,0.0,0.0,0.0,0.032258,0.032258,0.064516,0.0,0.0,0.032258,0.0,0.096774,0.032258,0.0,0.0,0.0,0.0,0.064516,0.0,0.032258,0.032258,0.0,0.0,0.0,0.0
1,Davisville North,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.111111,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.111111,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Forest Hill North & West, Forest Hill Road Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.25,0.0,0.0,0.0
3,Lawrence Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0
4,"Moore Park, Summerhill East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


__Let's confirm the new size__

In [45]:
## print size of dataframe
tname=CenToront_grouped.shape
print('New dataframe has {} rows and {} columns:'.format(tname[0],tname[1]),'\n')

New dataframe has 9 rows and 62 columns: 



__Let's print each neighborhood along with the top 5 most common venues__

In [46]:
## No of top venues to be printed 
num_top_venues = 5
## or loop to evalute and print each neighborhood
for hood in CenToront_grouped['Neighborhood']:
    print('---',hood,'---')
    temp = CenToront_grouped[CenToront_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues),'\n')

--- Davisville ---
                venue  freq
0        Dessert Shop  0.10
1      Sandwich Place  0.10
2  Italian Restaurant  0.06
3                 Gym  0.06
4                Café  0.06 

--- Davisville North ---
                  venue  freq
0                 Hotel  0.11
1      Department Store  0.11
2  Gym / Fitness Center  0.11
3                   Gym  0.11
4                  Park  0.11 

--- Forest Hill North & West, Forest Hill Road Park ---
                 venue  freq
0        Jewelry Store  0.25
1                Trail  0.25
2   Mexican Restaurant  0.25
3     Sushi Restaurant  0.25
4  American Restaurant  0.00 

--- Lawrence Park ---
                 venue  freq
0             Bus Line  0.33
1          Swim School  0.33
2                 Park  0.33
3  American Restaurant  0.00
4           Restaurant  0.00 

--- Moore Park, Summerhill East ---
                 venue  freq
0                  Gym   0.5
1                 Park   0.5
2  American Restaurant   0.0
3           Restaurant

__Let's put that into a *pandas* dataframe__   
First, let's write a function to sort the venues in descending order.
<!--  CenToront_grouped.iloc[0, :][1:].sort_values(ascending=False) -->

In [47]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

__Now let's create the new dataframe and display the top 10 venues for each neighborhood.__

In [48]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = CenToront_grouped['Neighborhood']

for ind in np.arange(CenToront_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(CenToront_grouped.iloc[ind, :], num_top_venues)
    ## **_sorted.iloc[ind, 1:] = **_grouped.apply(lambda x : x[1:].sort_values(ascending=False).index.values[0:num_top_venues], axis=1)[ind]
    
# print dataframe
print('New dataframe has {} Neighborhoods and top {} Venues listed for each:'.format(neighborhoods_venues_sorted.shape[0]
                                                                                     ,neighborhoods_venues_sorted.shape[1]-1),'\n')
neighborhoods_venues_sorted.head()

New dataframe has 9 Neighborhoods and top 10 Venues listed for each: 



Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Davisville,Dessert Shop,Sandwich Place,Pizza Place,Gym,Italian Restaurant,Sushi Restaurant,Coffee Shop,Café,Gourmet Shop,Gas Station
1,Davisville North,Hotel,Sandwich Place,Gym,Food & Drink Shop,Park,Pizza Place,Department Store,Gym / Fitness Center,Breakfast Spot,Bus Line
2,"Forest Hill North & West, Forest Hill Road Park",Jewelry Store,Trail,Sushi Restaurant,Mexican Restaurant,Fried Chicken Joint,Donut Shop,Farmers Market,Fast Food Restaurant,Food & Drink Shop,Yoga Studio
3,Lawrence Park,Swim School,Bus Line,Park,Yoga Studio,Diner,Gym,Grocery Store,Greek Restaurant,Gourmet Shop,Gas Station
4,"Moore Park, Summerhill East",Gym,Park,Yoga Studio,Diner,Gym / Fitness Center,Grocery Store,Greek Restaurant,Gourmet Shop,Gas Station,Garden


### 4. Cluster Neighborhoods
<hr style="height:1px;border-width:0;color:white;background-color:brown">

<li Style='font-size:110%;color:brown'> 
    Let's run <em>k</em>-means to cluster the neighborhood into 4 clusters. </li>

In [49]:
# set number of clusters
kclusters = 4

CenToront_grouped_clustering = CenToront_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(CenToront_grouped_clustering)

# check cluster labels generated for each row in the dataframe
print('Cluster Labels:',kmeans.labels_[0:])

Cluster Labels: [0 0 0 2 1 0 3 0 0]


In [50]:
## View cluster p
print('Cluster dataframe output:')
CenToront_grouped_clustering.head()

Cluster dataframe output:


Unnamed: 0,American Restaurant,BBQ Joint,Bagel Shop,Bank,Breakfast Spot,Brewery,Burger Joint,Bus Line,Café,Chinese Restaurant,Clothing Store,Coffee Shop,Convenience Store,Department Store,Dessert Shop,Diner,Donut Shop,Farmers Market,Fast Food Restaurant,Food & Drink Shop,Fried Chicken Joint,Garden,Gas Station,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,History Museum,Home Service,Hotel,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Jewelry Store,Light Rail Station,Liquor Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Park,Pharmacy,Pizza Place,Pub,Rental Car Location,Restaurant,Salon / Barbershop,Sandwich Place,Seafood Restaurant,Spa,Sporting Goods Shop,Sports Bar,Supermarket,Sushi Restaurant,Swim School,Thai Restaurant,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Yoga Studio
0,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.064516,0.0,0.0,0.064516,0.0,0.0,0.096774,0.032258,0.0,0.032258,0.0,0.0,0.0,0.0,0.032258,0.032258,0.032258,0.0,0.064516,0.0,0.0,0.0,0.0,0.0,0.032258,0.064516,0.0,0.0,0.0,0.0,0.0,0.0,0.032258,0.032258,0.064516,0.0,0.0,0.032258,0.0,0.096774,0.032258,0.0,0.0,0.0,0.0,0.064516,0.0,0.032258,0.032258,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.111111,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.111111,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.25,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<li Style='font-size:110%;color:brown'> 
    Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.</li>


In [51]:
# Add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

CenToront_merged = CentralToronto_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
CenToront_merged = CenToront_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

## view final output
print('\nFinal dataframe:')
CenToront_merged.head() # check the last columns of original df!


Final dataframe:


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,2,Swim School,Bus Line,Park,Yoga Studio,Diner,Gym,Grocery Store,Greek Restaurant,Gourmet Shop,Gas Station
1,M5N,Central Toronto,Roselawn,43.711695,-79.416936,3,Ice Cream Shop,Garden,Home Service,Yoga Studio,Diner,Gym / Fitness Center,Gym,Grocery Store,Greek Restaurant,Gourmet Shop
2,M4P,Central Toronto,Davisville North,43.712751,-79.390197,0,Hotel,Sandwich Place,Gym,Food & Drink Shop,Park,Pizza Place,Department Store,Gym / Fitness Center,Breakfast Spot,Bus Line
3,M5P,Central Toronto,"Forest Hill North & West, Forest Hill Road Park",43.696948,-79.411307,0,Jewelry Store,Trail,Sushi Restaurant,Mexican Restaurant,Fried Chicken Joint,Donut Shop,Farmers Market,Fast Food Restaurant,Food & Drink Shop,Yoga Studio
4,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678,0,Clothing Store,Coffee Shop,Chinese Restaurant,Gym / Fitness Center,Fast Food Restaurant,Diner,Mexican Restaurant,Miscellaneous Shop,Park,Rental Car Location


<li Style='font-size:110%;color:brown'> Finally,let's visualize the resulting clusters.</li>

In [52]:
# create map
#map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
map_clusters = folium.Map(location=[43.696948,-79.411307], zoom_start=13) # Forest Hill North - Coordinates

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
#rainbow = [colors.rgb2hex(i) for i in colors_array] ; print(colors_array,rainbow)
rainbow = ['#8000ff', '#983D3D','#232066', '#ff0000'] # ['#00b5eb']

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(CenToront_merged['Latitude'], CenToront_merged['Longitude']
                                  , CenToront_merged['Neighborhood'], CenToront_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=8,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

### 5. Examine Clusters
<hr style="height:1px;border-width:0;color:white;background-color:brown">

<li Style='font-size:110%;color:brown'> 
    Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster.</li>

#### Cluster 1

In [53]:
CenToront_merged.loc[CenToront_merged['Cluster Labels'] == 0, CenToront_merged.columns[[2] + list(range(6, CenToront_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Davisville North,Hotel,Sandwich Place,Gym,Food & Drink Shop,Park,Pizza Place,Department Store,Gym / Fitness Center,Breakfast Spot,Bus Line
3,"Forest Hill North & West, Forest Hill Road Park",Jewelry Store,Trail,Sushi Restaurant,Mexican Restaurant,Fried Chicken Joint,Donut Shop,Farmers Market,Fast Food Restaurant,Food & Drink Shop,Yoga Studio
4,"North Toronto West, Lawrence Park",Clothing Store,Coffee Shop,Chinese Restaurant,Gym / Fitness Center,Fast Food Restaurant,Diner,Mexican Restaurant,Miscellaneous Shop,Park,Rental Car Location
5,"The Annex, North Midtown, Yorkville",Sandwich Place,Café,Coffee Shop,Pizza Place,History Museum,Indian Restaurant,Donut Shop,Liquor Store,Middle Eastern Restaurant,Park
6,Davisville,Dessert Shop,Sandwich Place,Pizza Place,Gym,Italian Restaurant,Sushi Restaurant,Coffee Shop,Café,Gourmet Shop,Gas Station
8,"Summerhill West, Rathnelly, South Hill, Forest...",Pub,Light Rail Station,Coffee Shop,Sports Bar,Vietnamese Restaurant,Fried Chicken Joint,Liquor Store,Pizza Place,Restaurant,American Restaurant


#### Cluster 2

In [54]:
CenToront_merged.loc[CenToront_merged['Cluster Labels'] == 1, CenToront_merged.columns[[2] + list(range(6, CenToront_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
7,"Moore Park, Summerhill East",Gym,Park,Yoga Studio,Diner,Gym / Fitness Center,Grocery Store,Greek Restaurant,Gourmet Shop,Gas Station,Garden


#### Cluster 3

In [55]:
CenToront_merged.loc[CenToront_merged['Cluster Labels'] == 2, CenToront_merged.columns[[2] + list(range(6, CenToront_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Lawrence Park,Swim School,Bus Line,Park,Yoga Studio,Diner,Gym,Grocery Store,Greek Restaurant,Gourmet Shop,Gas Station


#### Cluster 4

In [56]:
CenToront_merged.loc[CenToront_merged['Cluster Labels'] == 3, CenToront_merged.columns[[2] + list(range(6, CenToront_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Roselawn,Ice Cream Shop,Garden,Home Service,Yoga Studio,Diner,Gym / Fitness Center,Gym,Grocery Store,Greek Restaurant,Gourmet Shop


__Note: To analyze cluster and its distinguish venue categories, below I have printed Venues for Neighborhoods in cluster.__

In [57]:
print('\nVenues in cluster: 4')
clust = CenToront_merged[CenToront_merged['Cluster Labels'] == 3]['Neighborhood'].tolist()
CenToront_venues[CenToront_venues['Neighborhood']==clust[0]][['Neighborhood','Venue','Venue Category']]


Venues in cluster: 4


Unnamed: 0,Neighborhood,Venue,Venue Category
3,Roselawn,Ceiling Champions,Home Service
4,Roselawn,Rosalind's Garden Oasis,Garden
5,Roselawn,Menchie's St. Clair West,Ice Cream Shop


<h6 style='text-align:center'> End of PART III </h6>
<hr style="height:2px;border-width:0;color:white;background-color:grey">