## Capstone Project - Neighbourhood Segmentation and Clustering using Foursquare API.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
pip install lxml #After restart the kernel

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/79/37/d420b7fdc9a550bd29b8cfeacff3b38502d9600b09d7dfae9a69e623b891/lxml-4.5.2-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 5.6MB/s eta 0:00:01��███▊                       | 1.5MB 5.6MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.2
Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt

# Seaborn
import seaborn as sns

!pip install geocoder
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder # to get coordinates

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-2.0.0                |     pyh9f0ad1d_0          63 KB  conda-forge
    openssl-1.1.1g             |       h516909a_1         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.2 MB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geo

## 1. Download and Explore Dataset

Importing the database

In [3]:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]

Quickly examine the resulting dataframe.

In [4]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Cleaning the database

In [5]:
df_clean = df.loc[(df['Borough'] != "Not assigned")]
df_clean.reset_index(drop=True, inplace=True)

In [6]:
df_clean.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [7]:
df_clean.shape

(103, 3)

Defining a function to get coordinates

In [8]:
# Defining a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Canada'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

# Call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(neighborhood) for neighborhood in df_clean["Neighbourhood"].tolist()]

In [9]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

Joining the coordinates with our database.

In [10]:
# merge the coordinates into the original dataframe
df_clean['Latitude'] = df_coords['Latitude']
df_clean['Longitude'] = df_coords['Longitude']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Checking the database size and some more information

In [11]:
# check the neighborhoods and the coordinates
print(df_clean.shape)
df_clean.head()

(103, 5)


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.686588,-79.409996
1,M4A,North York,Victoria Village,43.73154,-79.31428
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.659743,-79.361561
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72357,-79.43711
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.666622,-79.393264


In [12]:
print('The dataframe has {} Borough and {} Neighborhoods.'.format(
        len(df_clean['Borough'].unique()),
        df_clean.shape[0]
    )
)

The dataframe has 10 Borough and 103 Neighborhoods.


In [50]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Postal Code    103 non-null    object 
 1   Borough        103 non-null    object 
 2   Neighbourhood  103 non-null    object 
 3   Latitude       103 non-null    float64
 4   Longitude      103 non-null    float64
dtypes: float64(2), object(3)
memory usage: 4.1+ KB


Taking the coordinates of the city of Toronto

In [13]:
address = 'Toronto City, ON'

geolocator = Nominatim(user_agent="on_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.6534817, -79.3839347.


Creating the Toronto map and plotting all of their respective neighborhoods

In [14]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df_clean['Latitude'], df_clean['Longitude'], df_clean['Borough'], df_clean['Neighbourhood']):
    label = '{}'.format(neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='yellow',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## 2. Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

Define Foursquare Credentials and Version

In [15]:
CLIENT_ID = 'DG4GOW5ZH1VOSD5KHSGBSIQZKKRUOMLNRNEJTTEK1KVPPOPD' # your Foursquare ID
CLIENT_SECRET = '5JN3N2HHMVC04ISD2TXGZZ5CY2YBTV4WT3S2VW4BX41C2YX1' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: DG4GOW5ZH1VOSD5KHSGBSIQZKKRUOMLNRNEJTTEK1KVPPOPD
CLIENT_SECRET:5JN3N2HHMVC04ISD2TXGZZ5CY2YBTV4WT3S2VW4BX41C2YX1


We will create a function to obtain the 100 best locations within each neighborhood within a radius of 5000 meters.

In [16]:
neighborhood_latitude = df_clean.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df_clean.loc[0, 'Longitude'] # neighborhood longitude value
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 5000 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=DG4GOW5ZH1VOSD5KHSGBSIQZKKRUOMLNRNEJTTEK1KVPPOPD&client_secret=5JN3N2HHMVC04ISD2TXGZZ5CY2YBTV4WT3S2VW4BX41C2YX1&v=20180605&ll=43.6865884896713,-79.40999620161057&radius=5000&limit=100'

Send the GET request and examine the resutls

In [17]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f53f970cd732411634ffee1'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Toronto',
  'headerFullLocation': 'Toronto',
  'headerLocationGranularity': 'city',
  'totalResults': 236,
  'suggestedBounds': {'ne': {'lat': 43.731588534671346,
    'lng': -79.34788275514252},
   'sw': {'lat': 43.64158844467126, 'lng': -79.47210964807861}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bef48fcc80dc9284ec827e3',
       'name': 'Casa Loma',
       'location': {'address': '1 Austin Terrace',
        'crossStreet': 'at Walmer Rd',
        'lat': 43.677934,
        'lng': -79.409521,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.677

Creating the "getNearbyVenues" function

In [18]:
radius = 5000
LIMIT = 100

venues = []

for bor, neig, lat, long in zip(df_clean['Borough'], df_clean['Neighbourhood'], df_clean['Latitude'], df_clean['Longitude']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            bor,
            neig,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

Now write the code to run the above function on each neighborhood and create a new dataframe called venues_df.

In [19]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Borough', 'Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']

print(venues_df.shape)
venues_df.head()

(10119, 8)


Unnamed: 0,Borough,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,North York,Parkwoods,43.686588,-79.409996,Casa Loma,43.677934,-79.409521,Castle
1,North York,Parkwoods,43.686588,-79.409996,Scaramouche,43.681293,-79.399492,French Restaurant
2,North York,Parkwoods,43.686588,-79.409996,LCBO,43.681497,-79.391261,Liquor Store
3,North York,Parkwoods,43.686588,-79.409996,Cedarvale Park,43.692535,-79.428705,Field
4,North York,Parkwoods,43.686588,-79.409996,Pukka Restaurant,43.681055,-79.429187,Indian Restaurant


Let's check how many venues were returned for each neighborhood

In [20]:
venues_df.groupby(["Neighborhood"]).count().head()

Unnamed: 0_level_0,Borough,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Agincourt,100,100,100,100,100,100,100
"Alderwood, Long Branch",100,100,100,100,100,100,100
"Bathurst Manor, Wilson Heights, Downsview North",100,100,100,100,100,100,100
Bayview Village,100,100,100,100,100,100,100
"Bedford Park, Lawrence Manor East",100,100,100,100,100,100,100


Some more information about our database

In [21]:
print('There are {} uniques categories.'.format(len(venues_df['Venue Category'].unique())))

There are 247 uniques categories.


Analyze Each Neighborhood

In [22]:
# one hot encoding
toronto_onehot = pd.get_dummies(venues_df[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

print(toronto_onehot.shape)
toronto_onehot.head()

(10119, 248)


Unnamed: 0,Neighborhoods,Afghan Restaurant,Airport,Airport Lounge,American Restaurant,Antique Shop,Aquarium,Arcade,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Australian Restaurant,Auto Dealership,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Basketball Court,Basketball Stadium,Beach,Beer Bar,Beer Store,Bike Shop,Bistro,Bookstore,Botanical Garden,Boutique,Bowling Alley,Brazilian Restaurant,Breakfast Spot,Brewery,Bridal Shop,Bridge,Bubble Tea Shop,Burger Joint,Burrito Place,Butcher,Café,Cajun / Creole Restaurant,Campground,Candy Store,Cantonese Restaurant,Caribbean Restaurant,Castle,Caucasian Restaurant,Chinese Restaurant,Chocolate Shop,Circus,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Rec Center,Comedy Club,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Cricket Ground,Cuban Restaurant,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Gaming Cafe,Garden,Gas Station,Gastropub,General Entertainment,Golf Course,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Hakka Restaurant,Harbor / Marina,Hardware Store,Health & Beauty Service,Health Food Store,Historic Site,History Museum,Hobby Shop,Hockey Arena,Hookah Bar,Hostel,Hotel,Hotpot Restaurant,Hungarian Restaurant,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Indonesian Restaurant,Italian Restaurant,Japanese Restaurant,Jewelry Store,Juice Bar,Kebab Restaurant,Kitchen Supply Store,Korean Restaurant,Lake,Latin American Restaurant,Lingerie Store,Liquor Store,Lounge,Market,Massage Studio,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Monument / Landmark,Movie Theater,Museum,Music School,Music Store,Music Venue,Nail Salon,National Park,Neighborhood,New American Restaurant,Nightclub,Noodle House,Optical Shop,Organic Grocery,Other Great Outdoors,Outdoor Supply Store,Paintball Field,Pakistani Restaurant,Paper / Office Supplies Store,Park,Pastry Shop,Performing Arts Venue,Persian Restaurant,Peruvian Restaurant,Pet Store,Pharmacy,Pizza Place,Playground,Plaza,Poke Place,Pool,Pool Hall,Portuguese Restaurant,Poutine Place,Pub,Racecourse,Racetrack,Ramen Restaurant,Record Shop,Recreation Center,Rental Car Location,Restaurant,Rock Climbing Spot,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Science Museum,Sculpture Garden,Seafood Restaurant,Shoe Store,Shopping Mall,Shopping Plaza,Skating Rink,Ski Chalet,Smoke Shop,Smoothie Shop,Snack Place,Soccer Field,Soup Place,South American Restaurant,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Sri Lankan Restaurant,Steakhouse,Supermarket,Sushi Restaurant,Szechuan Restaurant,Taco Place,Taiwanese Restaurant,Tapas Restaurant,Tattoo Parlor,Tea Room,Tennis Stadium,Thai Restaurant,Theater,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Toy / Game Store,Track,Trail,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo,Zoo Exhibit
0,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Parkwoods,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Next, let's group rows by neighborhood and by taking the sum of the frequency of occurrence of each category

In [23]:
toronto_grouped = toronto_onehot.groupby(["Neighborhoods"]).sum().reset_index()
print(toronto_grouped.shape)
toronto_grouped.head()

(99, 248)


Unnamed: 0,Neighborhoods,Afghan Restaurant,Airport,Airport Lounge,American Restaurant,Antique Shop,Aquarium,Arcade,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Australian Restaurant,Auto Dealership,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Basketball Court,Basketball Stadium,Beach,Beer Bar,Beer Store,Bike Shop,Bistro,Bookstore,Botanical Garden,Boutique,Bowling Alley,Brazilian Restaurant,Breakfast Spot,Brewery,Bridal Shop,Bridge,Bubble Tea Shop,Burger Joint,Burrito Place,Butcher,Café,Cajun / Creole Restaurant,Campground,Candy Store,Cantonese Restaurant,Caribbean Restaurant,Castle,Caucasian Restaurant,Chinese Restaurant,Chocolate Shop,Circus,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Rec Center,Comedy Club,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Creperie,Cricket Ground,Cuban Restaurant,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Escape Room,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Gaming Cafe,Garden,Gas Station,Gastropub,General Entertainment,Golf Course,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Hakka Restaurant,Harbor / Marina,Hardware Store,Health & Beauty Service,Health Food Store,Historic Site,History Museum,Hobby Shop,Hockey Arena,Hookah Bar,Hostel,Hotel,Hotpot Restaurant,Hungarian Restaurant,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Indonesian Restaurant,Italian Restaurant,Japanese Restaurant,Jewelry Store,Juice Bar,Kebab Restaurant,Kitchen Supply Store,Korean Restaurant,Lake,Latin American Restaurant,Lingerie Store,Liquor Store,Lounge,Market,Massage Studio,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Monument / Landmark,Movie Theater,Museum,Music School,Music Store,Music Venue,Nail Salon,National Park,Neighborhood,New American Restaurant,Nightclub,Noodle House,Optical Shop,Organic Grocery,Other Great Outdoors,Outdoor Supply Store,Paintball Field,Pakistani Restaurant,Paper / Office Supplies Store,Park,Pastry Shop,Performing Arts Venue,Persian Restaurant,Peruvian Restaurant,Pet Store,Pharmacy,Pizza Place,Playground,Plaza,Poke Place,Pool,Pool Hall,Portuguese Restaurant,Poutine Place,Pub,Racecourse,Racetrack,Ramen Restaurant,Record Shop,Recreation Center,Rental Car Location,Restaurant,Rock Climbing Spot,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Science Museum,Sculpture Garden,Seafood Restaurant,Shoe Store,Shopping Mall,Shopping Plaza,Skating Rink,Ski Chalet,Smoke Shop,Smoothie Shop,Snack Place,Soccer Field,Soup Place,South American Restaurant,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Sri Lankan Restaurant,Steakhouse,Supermarket,Sushi Restaurant,Szechuan Restaurant,Taco Place,Taiwanese Restaurant,Tapas Restaurant,Tattoo Parlor,Tea Room,Tennis Stadium,Thai Restaurant,Theater,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Toy / Game Store,Track,Trail,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo,Zoo Exhibit
0,Agincourt,0,0,0,1,0,0,0,0,1,2,0,0,0,1,0,5,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,2,0,0,0,2,4,2,0,0,0,0,0,2,6,0,0,6,1,0,0,1,0,6,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,2,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,3,0,0,1,2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,3,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,1,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,3,3,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,1,0,2,0,0,0,0,0,0,1,0,0,0
1,"Alderwood, Long Branch",0,0,0,0,0,0,0,0,1,0,0,0,0,2,0,4,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,3,2,0,0,0,3,3,0,3,0,0,0,0,0,0,0,0,0,0,1,0,1,5,0,0,1,0,0,0,1,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,1,0,1,1,3,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,2,0,0,2,1,0,0,1,0,0,0,0,0,2,0,0,0,0,0,1,2,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,1,3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,1,0,0,0,3,0,1,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,3,0,0,0,0,0,1,0,1,1,0,0,0,1,0,0,0,0,0,1,0,2,0,0,0,1,0,0,2,0,0
2,"Bathurst Manor, Wilson Heights, Downsview North",0,1,0,2,0,0,0,0,1,0,1,0,0,1,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,5,0,0,0,0,0,0,0,0,0,0,1,7,0,12,0,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,1,1,0,1,0,1,0,1,0,0,1,0,5,2,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,3,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,1,2,0,1,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,1,2,0,0,0,0,0,0,0,1,1,0,0,0,2,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0
3,Bayview Village,0,0,0,1,0,0,0,0,1,0,0,0,1,0,1,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,3,0,0,0,2,0,0,0,0,4,0,0,3,0,0,0,1,0,3,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,3,0,0,0,0,1,0,1,1,3,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,3,0,0,0,0,6,0,0,0,2,0,0,0,1,0,0,5,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,2,0,2,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,2,4,4,1,0,0,0,0,0,0,3,1,0,0,0,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,"Bedford Park, Lawrence Manor East",0,0,0,2,0,0,0,0,1,2,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,1,0,0,0,0,4,1,1,0,0,0,0,0,4,0,0,3,0,0,0,2,0,4,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,2,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3,3,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,5,0,0,2,1,0,0,0,0,1,0,0,0,2,0,0,0,1,0,1,5,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,2,0,0,0,0,2,2,1,0,0,0,0,0,1,0,2,0,0,0,0,0,0,2,0,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,3,2,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,1,0,0,0,0,0,1,0,0,0


There are 98 coffee shops in Toronto. So now we want to select a good location where the number of coffee shops is less, so that our chances of opening a coffee shop in that location are good

In [24]:
len((toronto_grouped[toronto_grouped["Coffee Shop"] > 0]))

98

The 10 neighborhoods with the most coffee shops

In [25]:
toronto_coffee_shop = toronto_grouped[["Neighborhoods","Coffee Shop"]]

In [26]:
toronto_coffee_shop.sort_values(['Coffee Shop'], ascending=[False]).head(10)

Unnamed: 0,Neighborhoods,Coffee Shop
24,Downsview,20
42,"Islington Avenue, Humber Valley Village",14
57,"Northwest, West Humber - Clairville",12
2,"Bathurst Manor, Wilson Heights, Downsview North",12
64,"Regent Park, Harbourfront",12
74,"St. James Town, Cabbagetown",12
73,St. James Town,11
71,Scarborough Village,11
17,"Cliffside, Cliffcrest, Scarborough Village West",11
34,"Guildwood, Morningside, West Hill",11


The 10 neighborhoods that have less coffee shops

In [27]:
toronto_coffee_shop.sort_values(['Coffee Shop'], ascending=[True]).head(10)

Unnamed: 0,Neighborhoods,Coffee Shop
77,Studio District,0
30,"Forest Hill North & West, Forest Hill Road Park",1
28,"Fairview, Henry Farm, Oriole",1
98,"York Mills, Silver Hills",2
75,"Steeles West, L'Amoreaux West",2
37,Hillcrest Village,2
91,"Wexford, Maryvale",3
87,Victoria Village,3
85,"University of Toronto, Harbord",3
60,"Parkdale, Roncesvalles",3


## 3. Clustering the dataset using K-means (3 Clusters)

In [28]:
# set number of clusters
kclusters = 3

toronto_clustering = toronto_coffee_shop.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 0, 2, 0, 0, 2, 1, 1, 1, 0], dtype=int32)

We will create a new dataframe that includes all the clusters.

In [29]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
toronto_merged = toronto_coffee_shop.copy()

# add clustering labels
toronto_merged["Cluster Labels"] = kmeans.labels_

In [30]:
print(toronto_merged.shape)
toronto_merged.head()

(99, 3)


Unnamed: 0,Neighborhoods,Coffee Shop,Cluster Labels
0,Agincourt,6,1
1,"Alderwood, Long Branch",5,0
2,"Bathurst Manor, Wilson Heights, Downsview North",12,2
3,Bayview Village,3,0
4,"Bedford Park, Lawrence Manor East",4,0


We will include more information in our new dataframe to plot a map with the result of the K-Means algorithm.

In [31]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(df_clean.set_index('Neighbourhood'), on='Neighborhoods')

In [32]:
print(toronto_merged.shape)
toronto_merged.head()

(103, 7)


Unnamed: 0,Neighborhoods,Coffee Shop,Cluster Labels,Postal Code,Borough,Latitude,Longitude
0,Agincourt,6,1,M1S,Scarborough,43.78626,-79.28084
1,"Alderwood, Long Branch",5,0,M8W,Etobicoke,43.59354,-79.53275
2,"Bathurst Manor, Wilson Heights, Downsview North",12,2,M3H,North York,43.73737,-79.43417
3,Bayview Village,3,0,M2K,North York,43.7771,-79.37957
4,"Bedford Park, Lawrence Manor East",4,0,M5M,North York,43.751459,-79.265483


Creating a map to better visualize our dataframe data.

In [33]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhoods'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 4. Analyzing the 3 clusters resulting from our algorithm

In [34]:
print(len(toronto_merged.loc[toronto_merged['Cluster Labels'] == 0]))# 37 neighbourhoods/places in this cluster 0
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0]

37


Unnamed: 0,Neighborhoods,Coffee Shop,Cluster Labels,Postal Code,Borough,Latitude,Longitude
1,"Alderwood, Long Branch",5,0,M8W,Etobicoke,43.59354,-79.53275
3,Bayview Village,3,0,M2K,North York,43.7771,-79.37957
4,"Bedford Park, Lawrence Manor East",4,0,M5M,North York,43.751459,-79.265483
9,"CN Tower, King and Spadina, Railway Lands, Har...",3,0,M5V,Downtown Toronto,43.64544,-79.39514
14,Christie,4,0,M6G,Downtown Toronto,43.673059,-79.422094
16,"Clarks Corners, Tam O'Shanter, Sullivan",4,0,M1T,Scarborough,43.78643,-79.30156
23,"Dorset Park, Wexford Heights, Scarborough Town...",5,0,M1P,Scarborough,43.73704,-79.27694
25,"Dufferin, Dovercourt Village",5,0,M6H,West Toronto,43.666422,-79.438141
27,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",5,0,M9C,Etobicoke,43.63349,-79.57074
28,"Fairview, Henry Farm, Oriole",1,0,M2J,North York,43.77229,-79.34086


In [35]:
print(len(toronto_merged.loc[toronto_merged['Cluster Labels'] == 1]))# 48 neighbourhoods/places in this cluster 1
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1]

48


Unnamed: 0,Neighborhoods,Coffee Shop,Cluster Labels,Postal Code,Borough,Latitude,Longitude
0,Agincourt,6,1,M1S,Scarborough,43.78626,-79.28084
6,"Birch Cliff, Cliffside West",6,1,M1N,Scarborough,43.69472,-79.2646
7,"Brockton, Parkdale Village, Exhibition Place",6,1,M6K,West Toronto,43.64869,-79.38544
8,"Business reply mail Processing Centre, South C...",6,1,M7Y,East Toronto,43.64869,-79.38544
10,Caledonia-Fairbanks,8,1,M6E,York,43.68857,-79.45483
11,Canada Post Gateway Processing Centre,6,1,M7R,Mississauga,43.64869,-79.38544
12,Cedarbrae,8,1,M1H,Scarborough,43.747741,-79.235178
13,Central Bay Street,7,1,M5G,Downtown Toronto,43.665283,-79.387556
18,"Commerce Court, Victoria Hotel",8,1,M5L,Downtown Toronto,43.64879,-79.379515
19,Davisville,6,1,M4S,Central Toronto,43.70175,-79.38352


In [36]:
print(len(toronto_merged.loc[toronto_merged['Cluster Labels'] == 2]))# 18 neighbourhoods/places in this cluster 2
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2]

18


Unnamed: 0,Neighborhoods,Coffee Shop,Cluster Labels,Postal Code,Borough,Latitude,Longitude
2,"Bathurst Manor, Wilson Heights, Downsview North",12,2,M3H,North York,43.73737,-79.43417
5,Berczy Park,10,2,M5E,Downtown Toronto,43.64811,-79.37517
15,Church and Wellesley,10,2,M4Y,Downtown Toronto,43.6657,-79.38093
17,"Cliffside, Cliffcrest, Scarborough Village West",11,2,M1M,Scarborough,43.73865,-79.21699
24,Downsview,20,2,M3K,North York,43.720197,-79.499895
24,Downsview,20,2,M3L,North York,43.720197,-79.499895
24,Downsview,20,2,M3M,North York,43.720197,-79.499895
24,Downsview,20,2,M3N,North York,43.720197,-79.499895
31,"Garden District, Ryerson",10,2,M5B,Downtown Toronto,43.65794,-79.37562
34,"Guildwood, Morningside, West Hill",11,2,M1E,Scarborough,43.766033,-79.185389


## 5. Results

There are 37 Coffee Shops in cluster 0 and this cluster contains all the neighborhoods that have the lowest amount of this type of establishment.
Cluster 1 contains 48 Coffee Shops, the cluster that contains the largest number of neighborhoods, but with an intermediate quantity of Coffee Shops.
Cluster 2 contains 18 stores, the cluster with the fewest neighborhoods, but with the highest concentration of Coffee Shops.

The results of the K-means cluster show that we can categorize the neighborhoods into 3 clusters based on the frequency of occurrence for “Coffee Shop”:

• Cluster 0: Neighborhoods with a much smaller number of stores.

• Cluster 1: Neighborhoods with a moderate concentration of stores.

• Cluster 2: Neighborhoods with a high concentration of stores.

We see the results of the grouping on the map with cluster 0 in red, cluster 1 in purple and cluster 2 in green.

## 6. Conclusion

A good number of Coffee Shops are concentrated in the downtown area of ​​Toronto. Cluster 0 has a very low number of stores. This represents a great opportunity and areas of high potential for the opening of new stores, since there is little or no competition. Meanwhile, Cluster 2 Coffee Shops are probably suffering intense competition due to the excess supply and the high concentration of stores in the same segment. Analyzing only these points presented, we conclude that, for those interested in opening a Coffee Shop in Toronto, we strongly suggest the opening of new stores in the regions related to cluster 0 neighborhoods, where there is little or no competition.

We believe that competition between cluster 1 neighborhoods is moderate and that there is still room for new stores, but there is a greater risk of competition. Finally, entrepreneurs should avoid cluster 2 neighborhoods, which already have a high concentration of Coffe Shops and suffer intense competition. (However we can emphasize that regardless of the competition being high, if the service provided is excellent, the quality of the product is good and you have a differential, there is no competition that can resist a good deal).

We can apply the rationale of this project to several other areas or problems and what was presented is just an example of what we can do with this tool. In this project, we consider only one factor, that is, the frequency of occurrence of shopping centers, there are other factors such as population and income of residents that can influence the decision to locate a new Coffe Shop.