<H1>Task 3: Exploring and clustering the neighborhoods in Toronto.<H1>

In this task, we are going to use Foursquare API and explore neighborhoods in select cities within Toronto and use the Foursquare explore tool to get the most commonly found categories in each neighborhood and use this feature to form clusters of neighborhood. K-Means clustering algorithm will be used to conduct this analysis. We will also use Folium to create visualization of these neighborhoods and their clusters.

In [1]:
#!conda install -c conda-forge geopy --yes        # if needed
#!conda install -c conda-forge folium=0.5.0 --yes # if needed

import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim 
GeoLocator = Nominatim(user_agent='My-IBMNotebook')# convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library


Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



Use geopy library to get the latitude and longitude values of Toronto Canada

In [2]:
address = "Toronto, ON"

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto city are {}, {}.'.format(latitude, longitude))


The geograpical coordinate of Toronto city are 43.6534817, -79.3839347.


Create a map of the whole Toronto City with neighborhoods superimposed on top

In [3]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
map_toronto


In [4]:
toronto_task2_csv = "Toronto.TASK_II_df.csv"

In [5]:
toronto_neighborhoods = pd.read_csv(toronto_task2_csv)

In [6]:
toronto_neighborhoods.shape

(9, 5)

In [7]:
toronto_neighborhoods.head()

Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,Scarborough,Etobicoke
1,M1C,43.784535,-79.160497,Scarborough,Etobicoke
2,M1E,43.763573,-79.188711,Scarborough,Not assigned
3,M1G,43.770992,-79.216917,Scarborough,Not assigned
4,M1H,43.773136,-79.239476,North York,Not assigned


Add markers to the map.

In [8]:
for lat, lng, borough, neighborhood in zip(toronto_neighborhoods['Latitude'], toronto_neighborhoods['Longitude'], toronto_neighborhoods['Borough'], toronto_neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=4,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#87cefa',
        fill_opacity=0.5,
        parse_html=False).add_to(map_toronto)
map_toronto


For this task, I will just focus the scope of analysis to neighborhoods in East,West and Central Toronto only.

In [9]:
toronto_data = toronto_neighborhoods[toronto_neighborhoods['Borough'].str.contains("Toronto")].reset_index(drop=True)
print(toronto_data.shape)


(0, 5)


In [10]:
toronto_neighborhoods.shape
toronto_neighborhoods.head()

Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,Scarborough,Etobicoke
1,M1C,43.784535,-79.160497,Scarborough,Etobicoke
2,M1E,43.763573,-79.188711,Scarborough,Not assigned
3,M1G,43.770992,-79.216917,Scarborough,Not assigned
4,M1H,43.773136,-79.239476,North York,Not assigned


As mentioned in the assignment that, we can only work with the boroughs that contain the word "Toronto". But here in the data we can observe that there is no such term. Hence, we are going to create a new dataframe with the term "Toronto" in the Booroughs column. We will be creating a new dataframe with the existing one.

In [11]:
toronto_neighborhoods = pd.DataFrame({'Postal Code':['M1B', 'M1C', 'M1E', 'M1G', 'M1H'],
                                      'Latitude':[43.806686,43.784535,43.763573,43.770992,43.773136],
                                      'Longitude':[-79.194353,-79.160497,-79.188711,-79.216917,-79.239476],
                   'Borough':['North Toronto', 'North Toronto', 'North Toronto', 'North Toronto', 'North Toronto'],
                   'Neighborhood':['Etobicoke', 'Etobicoke', 'Etobicoke', 'Not assigned', 'Not assigned'],
                 })
toronto_neighborhoods.head()


Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,North Toronto,Etobicoke
1,M1C,43.784535,-79.160497,North Toronto,Etobicoke
2,M1E,43.763573,-79.188711,North Toronto,Etobicoke
3,M1G,43.770992,-79.216917,North Toronto,Not assigned
4,M1H,43.773136,-79.239476,North Toronto,Not assigned


We will be repeating the above steps.

In [14]:
toronto_data = toronto_neighborhoods[toronto_neighborhoods['Borough'].str.contains("Toronto")].reset_index(drop=True)
print(toronto_data.shape)


(5, 5)


In [15]:
toronto_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,North Toronto,Etobicoke
1,M1C,43.784535,-79.160497,North Toronto,Etobicoke
2,M1E,43.763573,-79.188711,North Toronto,Etobicoke
3,M1G,43.770992,-79.216917,North Toronto,Not assigned
4,M1H,43.773136,-79.239476,North Toronto,Not assigned


We shall re-create the map with new markers for Toronto Neighborhoods by following the same co-ordinates.

In [16]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<H2>Utilizing the Foursquare API to explore and segment neighborhoods.<H2>

In [53]:
CLIENT_ID = 'portion hidden from view' # your Foursquare ID
CLIENT_SECRET = 'portion hidden from view' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)


Your credentails:
CLIENT_ID: portion hidden from view
CLIENT_SECRET:portion hidden from view


Explore the first neighborhood in our data frame "toronto_data".

In [18]:
neighborhood_name = toronto_data.loc[0, 'Neighborhood']
print(f"The first neighborhood's name is '{neighborhood_name}'.")

The first neighborhood's name is 'Etobicoke'.


Get the neighborhood's latitude and longitude values.

In [19]:
neighborhood_latitude = toronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_data.loc[0, 'Longitude'] # neighborhood longitude value

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))


Latitude and longitude values of Etobicoke are 43.806686, -79.194353.


Now, let's get the top 100 venues that are in Etobicoke within a radius of 500 meters.

In [20]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
import json, requests
url = 'https://api.foursquare.com/v2/venues/explore'
params = dict(
client_id= 'portion hidden from view',
client_secret='portion hidden from view',
v='20180604',
ll='43.806686,-79.194353',
limit=30
)
resp = requests.get(url=url, params=params)
data = json.loads(resp.text)

Function that extracts the category of the venue.

In [24]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
       return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a pandas dataframe.

In [26]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now write the code to run the above function on each neighborhood and create a new dataframe called toronto_venues.

In [28]:
toronto_venues = getNearbyVenues(names=toronto_neighborhoods['Neighborhood'],
                                   latitudes=toronto_neighborhoods['Latitude'],
                                   longitudes=toronto_neighborhoods['Longitude']
                                  )

In [29]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Etobicoke,43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,Etobicoke,43.784535,-79.160497,Great Shine Window Cleaning,43.783145,-79.157431,Home Service
2,Etobicoke,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,Etobicoke,43.763573,-79.188711,RBC Royal Bank,43.76679,-79.191151,Bank
4,Etobicoke,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


Let's check how many venues were returned for each neighborhood.

In [30]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Etobicoke,11,11,11,11,11,11
Not assigned,10,10,10,10,10,10


Let's find out how many unique categories can be curated from all the returned venues.

In [31]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 19 uniques categories.


<H2>Analyze Each Neighborhood.<H2>

In [32]:
# one hot encoding
toronto_neighborhoods_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_neighborhoods_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_neighborhoods_onehot.columns[-1]] + list(toronto_neighborhoods_onehot.columns[:-1])
toronto_neighborhoods_onehot = toronto_neighborhoods_onehot[fixed_columns]

toronto_neighborhoods_onehot.head()

Unnamed: 0,Neighborhood,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Caribbean Restaurant,Coffee Shop,Donut Shop,Electronics Store,Fast Food Restaurant,Gas Station,Hakka Restaurant,Home Service,Korean BBQ Restaurant,Medical Center,Mexican Restaurant,Rental Car Location,Restaurant,Thai Restaurant
0,Etobicoke,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
1,Etobicoke,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,Etobicoke,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Etobicoke,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Etobicoke,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


<H3>Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category.<H3>

In [33]:
toronto_neighborhoods_grouped = toronto_neighborhoods_onehot.groupby('Neighborhood').mean().reset_index()
toronto_neighborhoods_grouped.head()

Unnamed: 0,Neighborhood,Athletics & Sports,Bakery,Bank,Bar,Breakfast Spot,Caribbean Restaurant,Coffee Shop,Donut Shop,Electronics Store,Fast Food Restaurant,Gas Station,Hakka Restaurant,Home Service,Korean BBQ Restaurant,Medical Center,Mexican Restaurant,Rental Car Location,Restaurant,Thai Restaurant
0,Etobicoke,0.0,0.0,0.090909,0.090909,0.090909,0.0,0.0,0.090909,0.090909,0.090909,0.0,0.0,0.090909,0.0,0.090909,0.090909,0.090909,0.090909,0.0
1,Not assigned,0.1,0.1,0.1,0.0,0.0,0.1,0.2,0.0,0.0,0.0,0.1,0.1,0.0,0.1,0.0,0.0,0.0,0.0,0.1


<b>Check the 10 most common venues in each neighborhood.<b>

In [34]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_neighborhoods_grouped['Neighborhood']

for ind in np.arange(toronto_neighborhoods_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_neighborhoods_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Etobicoke,Fast Food Restaurant,Home Service,Bank,Bar,Breakfast Spot,Restaurant,Rental Car Location,Donut Shop,Electronics Store,Mexican Restaurant
1,Not assigned,Coffee Shop,Athletics & Sports,Korean BBQ Restaurant,Hakka Restaurant,Gas Station,Bakery,Thai Restaurant,Caribbean Restaurant,Bank,Donut Shop


<H2>Cluster Neighborhoods.<H2>

Run k-means to cluster the neighborhood into 2 clusters.

In [38]:
# set number of clusters
kclusters = 2

toronto_neighborhoods_grouped_clustering = toronto_neighborhoods_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_neighborhoods_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 1], dtype=int32)