# This project is aimed at finding the best neighborhood for opening a coffee shop in Toronto

### Problem
Opening a coffee shop require a lot of money, efforts and time. That is why finding a good location is paramount to gaining profit and your coffee shops to grow. However, __business often hit or miss when it comes to choosing the right neighborhood to open their coffee shop.__

This capstone project attempt to **find the right place to open a new Coffee Shop in Toronto.** By using Foursquare API locations data to __define neighborhood with the highest populations, but few existing coffee shop. So the user can open a new shop, with many customers and low competitions.__

People who are interested in this will be business owners, coffee shop owners who ae seeking to opening up a new stores in Toronto.


### Data
To execute this report there are 2 important data that is need:
<br> 1/ The population data that match each neighborhood. This data is gather through Toronto public data census.
<br> 2/ The data and locations of existing coffee shops in Toronto, match to each neighborhood. This data will be taken from Foursquare API data base
<br> **How we will use the data to execute:**
We will use the neighborhood populations data first, then we will use it to draw the chologreph map of Toronto populations. Then we will get data of existing coffee shop in Toronto areas and create a cluster and display them on the chologreph map of Toronto populations. **The right neighborhood to build the new coffee shop is the one with the highest populations and the fewest number of existing coffee shop.**

In [58]:
import pandas as pd
import numpy as np
import numpy as np # library to handle data in a vectorized manner
!pip install geopy

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [59]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

In [60]:
# Drop all the 'Not assigned' value in Borough columns, reset index number in rows with reset_index()
df.drop(df.loc[df['Borough']=='Not assigned'].index, inplace=True)
df = df.reset_index(drop=True)

In [61]:
# Open the csv file for coordinates
df_coord = pd.read_csv(r'C:\Users\THAI SON\Desktop\Python Things\Geospatial_Coordinates.csv')

In [62]:
# merge coordinates data frame with the df for neighbors to creta final list)
df3 = df.merge(df_coord, how = 'left', on = 'Postal Code')
df3.head(11)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


From the map we see that most of the data is centrallized around the dock, habour areas.The downtown Toronto borough have the most neighbor hoods.

In [63]:
url1 = 'https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Table.cfm?Lang=Eng&T=1201&SR=1&S=22&O=A&RPP=9999&PR=0'
df_census = pd.read_html(url1)[0].dropna()
df_census

Unnamed: 0,Geographic name,"Population, 2016","Total private dwellings, 2016","Private dwellings occupied by usual residents, 2016"
1,CanadaFootnote 1,35151728.0,15412443.0,14072079.0
2,A0A,46587.0,26155.0,19426.0
3,A0B,19792.0,13658.0,8792.0
4,A0C,12587.0,8010.0,5606.0
5,A0E,22294.0,12293.0,9603.0
6,A0G,35266.0,21750.0,15200.0
7,A0H,17804.0,9928.0,7651.0
8,A0J,7880.0,4813.0,3426.0
9,A0K,26058.0,15159.0,11090.0
10,A0L,7643.0,3769.0,3178.0


In [64]:
df_census.rename(columns = {'Geographic name':'Postal Code'}, inplace = True)

In [65]:
# Merge the data contains populations with teh table containing the lad, long of each neighborhood. Merge along the Postal code line.
df_census1 = df3.merge(df_census, how = 'left', on = 'Postal Code')
df_census1 = df_census1.dropna()
df_census1.rename(columns = {'Population, 2016':'Population', 'Private dwellings occupied by usual residents, 2016':'Households'}, inplace = True)

In [66]:
df_census1

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Population,"Total private dwellings, 2016",Households
0,M3A,North York,Parkwoods,43.753259,-79.329656,34615.0,13847.0,13241.0
1,M4A,North York,Victoria Village,43.725882,-79.315572,14443.0,6299.0,6170.0
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,41078.0,24186.0,22333.0
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,21048.0,8751.0,8074.0
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,10.0,6.0,5.0
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242,35594.0,15730.0,15119.0
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,66108.0,20957.0,20230.0
7,M3B,North York,Don Mills,43.745906,-79.352188,13324.0,5193.0,5001.0
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,18628.0,7872.0,7599.0
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,12785.0,8249.0,7058.0


## This part to create population map of Toronto to determine the most populous areas

Now we will use the **FourSquare API**to get the data we need on the venue for each neighborhood in Toronto**

In [67]:
CLIENT_ID = 'YXDRSTG3ZSYGFGKTXZCQ3BUARRDEUHYOFVA14VOMS0YDZBIV' # your Foursquare ID
CLIENT_SECRET = 'YK3FLCPG1FSSATR1Y5Q1KEYICHTMAFQIUN3BGXVCBUM3WBP5' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: YXDRSTG3ZSYGFGKTXZCQ3BUARRDEUHYOFVA14VOMS0YDZBIV
CLIENT_SECRET:YK3FLCPG1FSSATR1Y5Q1KEYICHTMAFQIUN3BGXVCBUM3WBP5


In [68]:
#Create get by venue function to get data for all the venue in the Foursquare API
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue',
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [69]:
#Apply the get nearby venues function for each neighbor hood in toronto
toronto_venues = getNearbyVenues(names=df_census1['Neighborhood'],
                                   latitudes=df_census1['Latitude'],
                                   longitudes=df_census1['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [70]:
#check the table we ahve generated
print(toronto_venues.shape)
toronto_venues.head()

(2117, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Parkwoods,43.753259,-79.329656,Corrosion Service Company Limited,43.752432,-79.334661,Construction & Landscaping
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [107]:
# Sort the 'Venue Category' to only contains the Coffee shops data
df_coffeeshop = toronto_venues[toronto_venues['Venue Category']=='Coffee Shop']
df_coffeeshop.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
5,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
9,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
20,"Regent Park, Harbourfront",43.65426,-79.360636,Sumach Espresso,43.658135,-79.359515,Coffee Shop
21,"Regent Park, Harbourfront",43.65426,-79.360636,Arvo,43.649963,-79.361442,Coffee Shop
23,"Regent Park, Harbourfront",43.65426,-79.360636,Rooster Coffee,43.6519,-79.365609,Coffee Shop


## This part to create population map of Toronto to determine the most populous areas

In [100]:
#open the geographic geojson of toronto
world_geo = r'toronto.geojson'
print('Done')

Done


In [101]:
#Draw the map using long and lat of Toronto city
latitude = 43.6534817
longitude = -79.3839347
toronto_map = folium.Map(location=[latitude, longitude], zoom_start=11)

In [102]:
# create a numpy array of length 5 and has linear spacing from the minium total populations.This one to prevent negative for the space
threshold_scale = np.linspace(df_census1['Population'].min(),
                              df_census1['Population'].max(),
                              5, dtype=int)
threshold_scale = threshold_scale.tolist() # change the numpy array to a list
threshold_scale[-1] = threshold_scale[-1] + 1 # make sure that the last value of the list is greater than the maximum number

In [103]:
# draw choroleth map of population density in Toronto.
toronto_map.choropleth(
    geo_data=world_geo,
    data=df_census1,
    columns=['Neighborhood', 'Population'],
    key_on = 'feature.properties.HOOD',
    fill_color='YlOrRd',
    threshold_scale = threshold_scale,
    fill_opacity = 0.8,
    line_opacity=0.2,
    legend_name='Population & Coffeeshop locations in Toronto Neighborhood'
)

In [104]:
# add markers to map
for lat, lng, neighborhood in zip(df_coffeeshop['Venue Latitude'], df_coffeeshop['Venue Longitude'], df_coffeeshop['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map) 

In [105]:
toronto_map

In [147]:
# group by the merge the neighbor hood name and with coffee shops data along the Neighborhood columns
df_grouped = df_coffeeshop.groupby('Neighborhood').count().sort_values('Venue Category', ascending = False).reset_index()
df_combined1 = df_grouped.merge(df_census1, how = 'left', on = 'Neighborhood')
df_combined1

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Postal Code,Borough,Latitude,Longitude,Population,"Total private dwellings, 2016",Households
0,"Harbourfront East, Union Station, Toronto Islands",13,13,13,13,13,13,M5J,Downtown Toronto,43.640816,-79.381752,14545.0,9913.0,8649.0
1,"Commerce Court, Victoria Hotel",12,12,12,12,12,12,M5L,Downtown Toronto,43.648198,-79.379817,0.0,1.0,1.0
2,Central Bay Street,11,11,11,11,11,11,M5G,Downtown Toronto,43.657952,-79.387383,8423.0,5876.0,4929.0
3,Stn A PO Boxes,10,10,10,10,10,10,M5W,Downtown Toronto,43.646435,-79.374846,15.0,11.0,9.0
4,"Toronto Dominion Centre, Design Exchange",10,10,10,10,10,10,M5K,Downtown Toronto,43.647177,-79.381576,0.0,1.0,1.0
5,"First Canadian Place, Underground city",10,10,10,10,10,10,M5X,Downtown Toronto,43.648429,-79.38228,10.0,5.0,3.0
6,"Richmond, Adelaide, King",9,9,9,9,9,9,M5H,Downtown Toronto,43.650571,-79.384568,2005.0,1718.0,1243.0
7,"Queen's Park, Ontario Provincial Government",8,8,8,8,8,8,M7A,Downtown Toronto,43.662301,-79.389494,10.0,6.0,5.0
8,"Garden District, Ryerson",8,8,8,8,8,8,M5B,Downtown Toronto,43.657162,-79.378937,12785.0,8249.0,7058.0
9,"Regent Park, Harbourfront",7,7,7,7,7,7,M5A,Downtown Toronto,43.65426,-79.360636,41078.0,24186.0,22333.0


In [157]:
# Drop unesscary columns while also sort for large population neighborhood while also idnicate for the number of coffee shop in each neighbor hood
df_combined2 = df_combined1.loc[0:,['Neighborhood','Venue Category','Population']].sort_values('Population',ascending = False).reset_index(drop=True)
df_combined2.rename({'Venue Category':'Number of Coffee Shop'}, axis = 1, inplace = True)
df_combined2.head(10)

Unnamed: 0,Neighborhood,Number of Coffee Shop,Population
0,"Willowdale, Willowdale East",2,75897.0
1,"Fairview, Henry Farm, Oriole",5,58293.0
2,"Steeles West, L'Amoreaux West",1,48471.0
3,"Kennedy Park, Ionview, East Birchmount Park",1,48434.0
4,"Dufferin, Dovercourt Village",1,44950.0
5,"Regent Park, Harbourfront",7,41078.0
6,"Brockton, Parkdale Village, Exhibition Place",2,40957.0
7,"Willowdale, Willowdale West",1,40792.0
8,Don Mills,2,39153.0
9,"Eringate, Bloordale Gardens, Old Burnhamthorpe...",1,38291.0
