## Introduction: Business Problem

Bourke Street Bakery is a bakery that started in Surry Hills, Sydney, Australia offering hand made goods, catering services as well as baking classes. It has since expanded to eleven locations within the Greater Sydney area. 

The owners wish to expand their business outside of Sydney to Brisbane, Australia. This purpose of this study is to identify suburbs (i.e. neighbourhoods) in Brisbane that will be suitable for opening the first bakery. 

The suburbs where bakeries exist in Sydney will be examined and then similar suburbs in Brisbane will be identified as potential opeing locations. The primary criteria will be finding suburbs that have a similar distribution of establishments (e.g. restaurants, cafes, businesses, etc.) to those where bakeries are already established as well as considering the distance from the centre of the CBD (central business district). 


## Data

To solve the business problem the suburbs where restaurants already exist will need to be examined for any trend that link them together. When examining each suburb the following will be considered:

* Combined number of cafes and bakeries in each area
* Number and type of establishment in each area
* Proximity to the centre of the city

The following data sources will be used:

* The **geocoder** package in Python for examining geographical data of each suburb
* The **Foursquare API** for information on number of restaurants and other facilities in each suburb
* **geopy** can be used to find the distance between two locations

### Initial Locations

From [Bourke St Bakery's webiste](https://bourkestreetbakery.com.au/bakery-locations/ "Bourke St Bakery Locations") we know that bakeries already exist in the following suburbs: Alexandria, Balmain, Banksmeadow, Barangaroo, Kirrawee, Marrickville, Neutral Bay, Newtown, North Sydney, Parramatta, Surry Hills and Potts Point. We can use **Folium** and **geocoder** to get a general idea of the overall location of each bakery and use that information to make any assumptions or changes in scope.

In [2]:
import numpy as np
import pandas as pd
import json
import geocoder #geographical data
with open('apikey.txt', 'r') as f:
    apikey = f.readline() # stores google apikey
from geopy.geocoders import Nominatim # convert address to latitude and longitude
import requests # handle url requests
from pandas.io.json import json_normalize # transform JSON into DataFrame
import folium # map rendering

First create a list of all the suburbs with existing bakeries

In [3]:
bakery_existing = ['Alexandria', 'Balmain', 'Banksmeadow', 'Barangaroo', 
                   'Kirrawee', 'Marrickville', 'Neutral Bay', 'Newtown', 'North Sydney', 'Parramatta',
                   'Surry Hills', 'Potts Point']

Iterate though the suburbs to find latitude and longitude, for reference Sydney's latitude and longitude is 33.8688°S, 151.2093°E, i.e. (-33.8688, 151.2093).

In [4]:
lat_exist = []
long_exist = []
#loop through suburbs
for i in bakery_existing:
    coords = None
    while(coords is None): # repeat if failed call
        g = geocoder.google('{}, Sydney, Australia'.format(i), key=apikey)
        coords = g.latlng
    lat_exist.append(coords[0])
    long_exist.append(coords[1])
    
# create a dataframe
df_existing = pd.DataFrame({'Suburb':bakery_existing, 'Latitude':lat_exist, 'Longitude':long_exist})
print(df_existing.shape)
df_existing.head(10)

(12, 3)


Unnamed: 0,Suburb,Latitude,Longitude
0,Alexandria,-33.908027,151.190258
1,Balmain,-33.85895,151.17906
2,Banksmeadow,-33.95731,151.20699
3,Barangaroo,-33.863794,151.20223
4,Kirrawee,-34.03573,151.070795
5,Marrickville,-33.908667,151.152414
6,Neutral Bay,-33.833938,151.218846
7,Newtown,-33.897815,151.1785
8,North Sydney,-33.83965,151.20541
9,Parramatta,-33.813557,151.003407


Now we have the suburbs' coordinates which have existing bakeries. We can visualise this with **Folium**

In [20]:
lat_lng_syd = [-33.8688, 151.2093]
# create map of Sydney
map_sydney = folium.Map(location = lat_lng_syd, zoom_start=10)
#marker for sydney
folium.CircleMarker(
    lat_lng_syd,
    radius = 2,
    color='red',
    fill=True,
    fill_color='red').add_to(map_sydney)
#add markers for suburbs with bakeries
for lat, lng, suburb in zip(df_existing['Latitude'], df_existing['Longitude'], df_existing['Suburb']):
    label='{}'.format(suburb)
    #label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        tooltip=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        foll_opacity=0.7,
        parse_html=False).add_to(map_sydney)
# draw a 16km diameter circle around the centre of sydney
folium.Circle(color='red', radius=10000, location = lat_lng_syd).add_to(map_sydney)
map_sydney

From the above map we can get an idea of where the bakeries (blue markers) are located to in regards to the centre of Sydney (red markers). Firstly, most are located fairly centrally with the excetion of Parramatta (west) and Kirrawee (south). If we only consider bakeries within 10km of the city (the red circle) we are left with nine of the starting twelve bakeries, with Parramatta and Kirrawee being excluded. This may mitigate other factors, such as Parramatta almost being a separate city, when analysing the suburbs. This will leave us with ten suburbs.

In [8]:
#resulting dataframe
df_existing=df_existing[(df_existing['Suburb'] != 'Parramatta') & (df_existing['Suburb'] != 'Kirrawee')]
df_existing.reset_index(inplace=True, drop=True)
df_existing

Unnamed: 0,Suburb,Latitude,Longitude
0,Alexandria,-33.908027,151.190258
1,Balmain,-33.85895,151.17906
2,Banksmeadow,-33.95731,151.20699
3,Barangaroo,-33.863794,151.20223
4,Marrickville,-33.908667,151.152414
5,Neutral Bay,-33.833938,151.218846
6,Newtown,-33.897815,151.1785
7,North Sydney,-33.83965,151.20541
8,Surry Hills,-33.886111,151.211111
9,Potts Point,-33.86795,151.22411


### Foursquare data for the remaining suburbs
For this initial submission, **Foursquare API** data will be examined to get a general idea of what facilities exists within each suburb.

In [9]:
# get foursquare credentials
with open('foursquare_credentials.txt') as f:
    l = f.readline()
    fs_cred = json.loads(l)
client_id = fs_cred['CLIENT_ID']
client_secret = fs_cred['CLIENT_SECRET']
access_token = fs_cred['ACCESS_TOKEN']
version = fs_cred['VERSION']

For initial estimates, we can use a default radius of 1000m when searching within each suburb. Although most suburbs do vary in size and shape this is good enough for a first estimate.

In [10]:
radius = '1000' # radius of search area
limit = '100' # limit number of returns
# for the suburb of Alexandria
lat = df_existing['Latitude'].iloc[0]
lng = df_existing['Longitude'].iloc[0]
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&oauth_token={}&ll={},{},&v={}&radius={}&limit={}'.format(
    client_id, client_secret, access_token, str(lat), str(lng), version, radius, limit)

In [14]:
# iterate through the rows of the DataFrame
venue_list = []
for index, row in df_existing.iterrows():
    print(row['Suburb'])
    lat = row['Latitude']
    lng = row['Longitude']
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&oauth_token={}&ll={},{},&v={}&radius={}&limit={}'.format(
        client_id, client_secret, access_token, str(lat), str(lng), version, radius, limit)
    results = requests.get(url).json()
    
    items = results['response']['groups'][0]['items']
    
    # iterate though all the items
    venue_list.append([(row['Suburb'], item['venue']['name'], item['venue']['categories'][0]['name']) for item in items])


    

Alexandria
Balmain
Banksmeadow
Barangaroo
Marrickville
Neutral Bay
Newtown
North Sydney
Surry Hills
Potts Point


In [15]:
df_venues = pd.DataFrame([item for items in venue_list for item in items], columns = ['Suburb', 'Venue_Name', 'Venue_Type'])
print(df_venues.shape)
df_venues.head()

(839, 3)


Unnamed: 0,Suburb,Venue_Name,Venue_Type
0,Alexandria,Pino’s Vino e Cucina,Italian Restaurant
1,Alexandria,The Grounds of Alexandria,Café
2,Alexandria,The Potting Shed at The Grounds,Bar
3,Alexandria,Bunnings Warehouse,Hardware Store
4,Alexandria,La Cachette,Café


In total we have 839 venues across all the suburbs that already have a bakery in. We can now examing the venues types to see if the suburbs have anything in common, for now we will just look at the five most common types of venues, as well as the number of venues. That way we can hopefully get an idea of the types of venues in the suburb but also how busy the suburb typically is (assuming more venues means the suburb is more busy).

First get encoding for the dataframe.

In [16]:
#a quick look at the data
df_venues.groupby('Suburb').count()

Unnamed: 0_level_0,Venue_Name,Venue_Type
Suburb,Unnamed: 1_level_1,Unnamed: 2_level_1
Alexandria,82,82
Balmain,86,86
Banksmeadow,15,15
Barangaroo,100,100
Marrickville,75,75
Neutral Bay,81,81
Newtown,100,100
North Sydney,100,100
Potts Point,100,100
Surry Hills,100,100


**Note:** Each Foursquare search only returns 100 results. Searches can be repeated to unique hits separated but for now we will assume that 100 results will be representative of the suburb.

In [17]:
print('There are {} unique venues across the examined suburbs'.format(len(df_venues['Venue_Type'].unique())))

There are 171 unique venues across the examined suburbs


In [18]:
#apply one hot encoding to the dataframe
df_venues_one_hot = pd.get_dummies(df_venues[['Venue_Type']], prefix="", prefix_sep="")
print(df_venues_one_hot.shape)
df_venues_one_hot.head(5)

(839, 171)


Unnamed: 0,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Australian Restaurant,BBQ Joint,...,Train Station,Tree,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Look at the top 5 most common venues in each suburb

In [19]:
df_venues_ = pd.concat([df_venues, df_venues_one_hot], axis=1)
df_venues_sum=df_venues_.groupby('Suburb', axis=0).sum().reset_index()
df_venues_sum
for suburb in df_venues_sum['Suburb']:
    print('Suburb: {}'.format(suburb))
    temp=df_venues_sum[df_venues_sum['Suburb']==suburb].T.reset_index()
    temp=temp.iloc[1:]
    temp.columns=['Venue','Number']
    print(temp.sort_values('Number', ascending=False).reset_index(drop=True).head(5))
    print('\n')

Suburb: Alexandria
                   Venue Number
0                   Café     22
1     Italian Restaurant      4
2  Vietnamese Restaurant      3
3                    Pub      3
4   Gym / Fitness Center      3


Suburb: Balmain
              Venue Number
0              Café     21
1               Pub      9
2               Bar      5
3  Sushi Restaurant      4
4            Bakery      3


Suburb: Banksmeadow
                 Venue Number
0                 Café      3
1                 Port      1
2       Clothing Store      1
3                Beach      1
4  Rental Car Location      1


Suburb: Barangaroo
         Venue Number
0         Café      9
1          Bar      7
2  Coffee Shop      7
3    Speakeasy      5
4        Hotel      4


Suburb: Marrickville
                   Venue Number
0                   Café     18
1  Vietnamese Restaurant     13
2                 Bakery      5
3          Deli / Bodega      3
4        Thai Restaurant      3


Suburb: Neutral Bay
                 

#### Results

From the results above, we can see that cafes are the most common type of venues across all suburbs. The next most common type of venues are restaurants and pubs/bars. This suggests that there is some trend here and it is worth analysing further.

The next step will be to repeat this process for all suburbs in the region of interest and use the data to train a classification model

#### Next Steps

At this point, only the types of establishment in suburbs have been found. The next step will be to take the average number of establishments in each suburb using **mean()** instead of **sum()** when grouping the suburbs after one hot encoding. The combination of cafes and bakeries will either be included as an additional column or merged in the final dataframe. The **distance** function from **goepy** can be used to calculate the distance from the centre of each suburb to the centre of each city.

Once the dataframe has been established, k-means clustering will be used on all the suburbs within a 10km radius from the centre of Sydney with all the suburbs with bakeries hopefully belonging to the same cluster.