# Capstone Project (Week 5)
### Opening a New Pub in Richmond hill, Ontario


## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

The history of pubs can be traced to Roman taverns in Britain in 43 AD (“Historic UK”, n.d., para 3). It means pubs have been existed for almost 2,000 years. What reasons that make pubs last so long? Tribute Cornish Pale Ale conducted a poll about reasons for visiting a pub, and the results showed that catching up with friends, the atmospheres and the opportunity to have a few drinks were the top three reasons (“Beertoday”, 2017). For many peoples, going to a pub is a way to relax and enjoy themselves after work, at weekends or during holidays. On the other hand, the profitability of pubs has strong correlation with sports events (“Financial Time”, n.d.). For example, large number of residents in the neighbourhoods is attracted to nearby pubs on game nights. There are hundreds of pubs in Richmond hill. Therefore, selecting location of a pub is crucial for the pub owner’s investment decision. 

The main purpose of this project is **to find an optimal location for a new pub in Richmond hill, Ontario** by using data science methodology, specifically machine learning techniques.


## Data <a name="data"></a>

The following data is what we need for this project:

* List of neighbourhoods including postal code in Richmond hill, Ontario. 
* Latitude and longitude coordinates of the neighbourhoods. 
* venue data, particularly data related to pubs on the neighbourhoods.

Sources of the data and methods for manipulating

   * Richmond hill is a city in south-central York Region, Ontario, Canada. The city contains four major neighbourhoods: Richmond hill (Southeast), Richmond hill (Southwest), Richmond hill (Oak Rideges/Lake Wilcox/Temperaneceville) and Richmond hill (Central). The webpage https://en.wikipedia.org/wiki/Richmond_Hill,_Ontario#Communities includes a list of neighbourhoods in Richmond hill. 

   * Since the number of neighbourhoods is small, so we can just build a dataframe using **pandas** of Python data analysis library. Then, we will get the geographical coordinates of the neighbourhoods using **Python Geocoder package**. Next, we will use **Foursquare API** to obtain the venue data for the neighbourhoods. Foursquare API will provide a lot of categories of the venue data. However, we are interested in the pub category data which can help us to solve the business problem as mentioned above. Finally, we will clean the dataset, apply machine learning skill **K- means clustering** and map visualization by using **Folium package**.


### 1. Import Libraries

In [5]:
import numpy as np
import pandas as pd 
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

In [6]:
import json

In [7]:
from geopy.geocoders import Nominatim

In [9]:
import requests
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

In [11]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.4.1               |             py_0          26 KB  conda-forge
    ca-certificates-2020.6.20  |       hecda079_0         145 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    certifi-2020.6.20          |   py36h9f0ad1d_0         151 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    openssl-1.1.1g             |       h516909a_1         2.1 MB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    ------------------------------------------------------------
                       

In [14]:
!pip install geocoder
import geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 7.2MB/s ta 0:00:011
[?25hCollecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


### 2. Form dataframe

In [15]:
df = {'PostalCode': ['L4B', 'L4C', 'L4E', 'L4S'], 'Neighbourhood': ['Richmond hill (Southeast)', 'Richmond hill (Southwest)', 'Richmond hill (Oak Ridges / Lake Wilcox / Temperanceville)', 'Richmond hill (Central)']}

df = pd.DataFrame(data = df)

In [16]:
df

Unnamed: 0,PostalCode,Neighbourhood
0,L4B,Richmond hill (Southeast)
1,L4C,Richmond hill (Southwest)
2,L4E,Richmond hill (Oak Ridges / Lake Wilcox / Temp...
3,L4S,Richmond hill (Central)


In [17]:
df.shape

(4, 2)

### 3. Get the geographical coordinates

In [18]:
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Richmond hill, Ontario'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [20]:
coords = [ get_latlng(neighborhood) for neighborhood in df["PostalCode"].tolist() ]

In [21]:
coords

[[43.85865000000007, -79.39261999999997],
 [43.86364000000003, -79.43937999999997],
 [43.94001000000003, -79.43578999999994],
 [43.89609000000007, -79.40531999999996]]

In [22]:
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [23]:
df['Latitude'] = df_coords['Latitude']
df['Longitude'] = df_coords['Longitude']

In [24]:
print(df.shape)
df

(4, 4)


Unnamed: 0,PostalCode,Neighbourhood,Latitude,Longitude
0,L4B,Richmond hill (Southeast),43.85865,-79.39262
1,L4C,Richmond hill (Southwest),43.86364,-79.43938
2,L4E,Richmond hill (Oak Ridges / Lake Wilcox / Temp...,43.94001,-79.43579
3,L4S,Richmond hill (Central),43.89609,-79.40532


In [26]:
df.to_csv("df.csv", index=False)

### 4. Create a map of Richmond hill 

In [27]:
address = 'Richmond hill, Ontario'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Richmond hill, Ontario {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Richmond hill, Ontario 43.880078, -79.439392.


In [28]:
map_rh = folium.Map(location=[latitude, longitude], zoom_start=11)

In [30]:
for lat, lng, neighborhood in zip(df['Latitude'], df['Longitude'], df['PostalCode']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_rh)  
    
map_rh

In [31]:
map_rh.save('map_rh.html')

## Methodology <a name="methodology"></a>

* Step 1: we need to get the list of neighbourhoods in the city of Richmond hill. The information is available in the webpage https://en.wikipedia.org/wiki/Richmond_Hill,_Ontario#Communities . 
* Step 2: we make a dataframe using pandas of Python data analysis library to contain neighbourhood’s data. 
* Step 3: we need to know the geographical coordinates in the form of latitude and longitude by using Geocoder package. Then, we put the data into the dataframe and visualize the neighbourhoods in a map by using Folium package.
* Step 4: we use Foursquare API to get the top 100 venues that are within a radius of 2000 meters. Foursquare will return the venue data in JSON format. After extracting the venue name, category, latitude and longitude, we can check how many venues were returned for each neighbourhood and examine how many unique categories.
* Step 5: we analyse each neighbourhood by grouping Postal code and taking the means of the frequency of occurrence of each venue category. Then, we filter the pub category for the neighbourhoods.
* Step 6: we perform K-means clustering. We cluster the neighbourhoods into clusters based on their frequency of occurrence for pub category.


## Analysis <a name="analysis"></a>

### 5. Use the Foursquare API to explore the neighbourhoods

In [92]:
CLIENT_ID = 'your Foursquare ID'  
CLIENT_SECRET = 'your Foursquare Secret' # 
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: your Foursquare ID
CLIENT_SECRET:your Foursquare Secret


In [33]:

radius = 2000
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(df['Latitude'], df['Longitude'], df['PostalCode']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [34]:
venues_df = pd.DataFrame(venues)

In [41]:
venues_df.columns = ['PostalCode', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

In [42]:
print(venues_df.shape)
venues_df.head()

(253, 7)


Unnamed: 0,PostalCode,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,L4B,43.85865,-79.39262,Fresh Burger,43.861472,-79.387122,Burger Joint
1,L4B,43.85865,-79.39262,Moksha Yoga Richmond Hill,43.861628,-79.388474,Yoga Studio
2,L4B,43.85865,-79.39262,Holiday Inn Express & Suites Toronto - Markham,43.849209,-79.382256,Hotel
3,L4B,43.85865,-79.39262,Cafe Bon Bon,43.845,-79.384519,Dessert Shop
4,L4B,43.85865,-79.39262,Ichiban Fish House,43.866228,-79.386254,Japanese Restaurant


In [43]:
venues_df.groupby(["PostalCode"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
L4B,100,100,100,100,100,100
L4C,70,70,70,70,70,70
L4E,28,28,28,28,28,28
L4S,55,55,55,55,55,55


In [44]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 94 uniques categories.


In [45]:
venues_df['VenueCategory'].unique()[:50]

array(['Burger Joint', 'Yoga Studio', 'Hotel', 'Dessert Shop',
       'Japanese Restaurant', 'Hong Kong Restaurant', 'Coffee Shop',
       'Dumpling Restaurant', 'Community Center', 'Brazilian Restaurant',
       'Restaurant', 'Gym / Fitness Center', 'BBQ Joint',
       'Chinese Restaurant', 'Lounge', 'Bubble Tea Shop', 'Bakery',
       'Bank', 'Middle Eastern Restaurant', 'New American Restaurant',
       'Fried Chicken Joint', 'Dim Sum Restaurant', 'Ramen Restaurant',
       'Breakfast Spot', 'Shanghai Restaurant', 'Noodle House',
       'Indian Restaurant', 'Cocktail Bar', 'Cantonese Restaurant',
       'Vegetarian / Vegan Restaurant', 'Bowling Alley',
       'Peking Duck Restaurant', 'Asian Restaurant',
       'Vietnamese Restaurant', 'Korean Restaurant', 'Pharmacy',
       'Sandwich Place', 'Gas Station', 'Indian Chinese Restaurant',
       'Paper / Office Supplies Store', 'Fast Food Restaurant',
       'Wings Joint', 'Toy / Game Store', 'Hotpot Restaurant',
       'Shopping Mall'

In [46]:
"Pub" in venues_df['VenueCategory'].unique()

True

### 6. Analyse each neighbourhood

In [49]:
rh_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")
rh_onehot['PostalCode'] = venues_df['PostalCode'] 
fixed_columns = [rh_onehot.columns[-1]] + list(rh_onehot.columns[:-1])
rh_onehot = rh_onehot[fixed_columns]
print(rh_onehot.shape)
rh_onehot.head()

(253, 95)


Unnamed: 0,PostalCode,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bank,Beer Store,Bookstore,Bowling Alley,Brazilian Restaurant,Breakfast Spot,Bubble Tea Shop,Burger Joint,Bus Stop,Café,Cantonese Restaurant,Cheese Shop,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,Community Center,Cosmetics Shop,Cupcake Shop,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dumpling Restaurant,Fast Food Restaurant,Food & Drink Shop,Food Truck,Fried Chicken Joint,Furniture / Home Store,Gas Station,Grocery Store,Gym,Gym / Fitness Center,Home Service,Hong Kong Restaurant,Hotel,Hotpot Restaurant,Hungarian Restaurant,IT Services,Ice Cream Shop,Indian Chinese Restaurant,Indian Restaurant,Intersection,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Lake,Laser Tag,Liquor Store,Lounge,Mediterranean Restaurant,Middle Eastern Restaurant,Mini Golf,Moving Target,Music Store,New American Restaurant,Noodle House,Paper / Office Supplies Store,Park,Peking Duck Restaurant,Performing Arts Venue,Pet Store,Pharmacy,Pilates Studio,Pizza Place,Pool,Portuguese Restaurant,Pub,Ramen Restaurant,Restaurant,Sandwich Place,Shanghai Restaurant,Shopping Mall,Shopping Plaza,Skating Rink,Soccer Field,Sporting Goods Shop,Supermarket,Sushi Restaurant,Tea Room,Toy / Game Store,Trail,Vape Store,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Warehouse Store,Wings Joint,Yoga Studio
0,L4B,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,L4B,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,L4B,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,L4B,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,L4B,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [50]:
# let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
rh_grouped = rh_onehot.groupby(["PostalCode"]).mean().reset_index()
print(rh_grouped.shape)
rh_grouped

(4, 95)


Unnamed: 0,PostalCode,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bank,Beer Store,Bookstore,Bowling Alley,Brazilian Restaurant,Breakfast Spot,Bubble Tea Shop,Burger Joint,Bus Stop,Café,Cantonese Restaurant,Cheese Shop,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,Community Center,Cosmetics Shop,Cupcake Shop,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dumpling Restaurant,Fast Food Restaurant,Food & Drink Shop,Food Truck,Fried Chicken Joint,Furniture / Home Store,Gas Station,Grocery Store,Gym,Gym / Fitness Center,Home Service,Hong Kong Restaurant,Hotel,Hotpot Restaurant,Hungarian Restaurant,IT Services,Ice Cream Shop,Indian Chinese Restaurant,Indian Restaurant,Intersection,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Lake,Laser Tag,Liquor Store,Lounge,Mediterranean Restaurant,Middle Eastern Restaurant,Mini Golf,Moving Target,Music Store,New American Restaurant,Noodle House,Paper / Office Supplies Store,Park,Peking Duck Restaurant,Performing Arts Venue,Pet Store,Pharmacy,Pilates Studio,Pizza Place,Pool,Portuguese Restaurant,Pub,Ramen Restaurant,Restaurant,Sandwich Place,Shanghai Restaurant,Shopping Mall,Shopping Plaza,Skating Rink,Soccer Field,Sporting Goods Shop,Supermarket,Sushi Restaurant,Tea Room,Toy / Game Store,Trail,Vape Store,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Warehouse Store,Wings Joint,Yoga Studio
0,L4B,0.0,0.02,0.0,0.03,0.03,0.04,0.0,0.0,0.01,0.01,0.03,0.03,0.01,0.0,0.02,0.05,0.0,0.06,0.0,0.01,0.04,0.01,0.0,0.0,0.0,0.04,0.03,0.0,0.0,0.02,0.02,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.06,0.02,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.05,0.02,0.0,0.0,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.02,0.01,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.04,0.02,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.01,0.02,0.0,0.01,0.02
1,L4C,0.0,0.0,0.0,0.0,0.014286,0.042857,0.014286,0.014286,0.0,0.0,0.0,0.014286,0.014286,0.0,0.014286,0.0,0.014286,0.014286,0.014286,0.0,0.071429,0.014286,0.0,0.0,0.028571,0.014286,0.0,0.014286,0.014286,0.0,0.014286,0.014286,0.0,0.0,0.014286,0.014286,0.042857,0.014286,0.014286,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,0.0,0.014286,0.014286,0.0,0.0,0.014286,0.014286,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.014286,0.0,0.028571,0.014286,0.057143,0.014286,0.014286,0.014286,0.014286,0.042857,0.014286,0.0,0.014286,0.0,0.014286,0.0,0.028571,0.014286,0.028571,0.0,0.014286,0.0,0.0,0.0,0.028571,0.0,0.014286,0.0
2,L4E,0.0,0.0,0.035714,0.0,0.035714,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.035714,0.0,0.035714,0.0,0.0,0.0,0.0,0.035714,0.071429,0.035714,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.035714,0.0,0.071429,0.0,0.0,0.0,0.0,0.071429,0.035714,0.0,0.0,0.0,0.035714,0.0,0.0,0.035714,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0
3,L4S,0.018182,0.0,0.0,0.0,0.0,0.054545,0.0,0.0,0.0,0.0,0.036364,0.0,0.036364,0.018182,0.018182,0.018182,0.0,0.036364,0.0,0.0,0.090909,0.0,0.018182,0.018182,0.0,0.018182,0.0,0.036364,0.018182,0.0,0.0,0.0,0.018182,0.0,0.0,0.0,0.018182,0.018182,0.0,0.018182,0.0,0.0,0.0,0.0,0.018182,0.018182,0.0,0.0,0.018182,0.018182,0.0,0.018182,0.0,0.0,0.018182,0.0,0.018182,0.0,0.0,0.018182,0.018182,0.0,0.0,0.018182,0.054545,0.0,0.0,0.018182,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.036364,0.0,0.0,0.018182,0.036364,0.018182,0.0,0.0,0.072727,0.0,0.0,0.0,0.018182,0.0,0.018182,0.018182,0.0,0.0


In [56]:
rh_pub = rh_grouped[["PostalCode","Pub"]]

In [57]:
rh_pub

Unnamed: 0,PostalCode,Pub
0,L4B,0.0
1,L4C,0.014286
2,L4E,0.0
3,L4S,0.0


In [64]:
kclusters = 2
rh_clustering = rh_pub.drop(["PostalCode"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(rh_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 1, 0, 0], dtype=int32)

In [66]:
rh_merged = rh_pub.copy()

# add clustering labels
rh_merged["Cluster Labels"] = kmeans.labels_

In [67]:
rh_merged.rename(columns={"PostalCode": "PostalCode"}, inplace=True)
rh_merged

Unnamed: 0,PostalCode,Pub,Cluster Labels
0,L4B,0.0,0
1,L4C,0.014286,1
2,L4E,0.0,0
3,L4S,0.0,0


In [68]:
rh_merged = rh_merged.join(df.set_index("PostalCode"), on="PostalCode")

print(rh_merged.shape)
rh_merged

(4, 6)


Unnamed: 0,PostalCode,Pub,Cluster Labels,Neighbourhood,Latitude,Longitude
0,L4B,0.0,0,Richmond hill (Southeast),43.85865,-79.39262
1,L4C,0.014286,1,Richmond hill (Southwest),43.86364,-79.43938
2,L4E,0.0,0,Richmond hill (Oak Ridges / Lake Wilcox / Temp...,43.94001,-79.43579
3,L4S,0.0,0,Richmond hill (Central),43.89609,-79.40532


In [70]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(rh_merged['Latitude'], rh_merged['Longitude'], rh_merged['PostalCode'], rh_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [71]:
map_clusters.save('map_clusters.html')

### 8. Examine clusters

In [73]:
rh_merged.loc[rh_merged['Cluster Labels'] == 0]

Unnamed: 0,PostalCode,Pub,Cluster Labels,Neighbourhood,Latitude,Longitude
0,L4B,0.0,0,Richmond hill (Southeast),43.85865,-79.39262
2,L4E,0.0,0,Richmond hill (Oak Ridges / Lake Wilcox / Temp...,43.94001,-79.43579
3,L4S,0.0,0,Richmond hill (Central),43.89609,-79.40532


In [75]:
rh_merged.loc[rh_merged['Cluster Labels'] == 1]

Unnamed: 0,PostalCode,Pub,Cluster Labels,Neighbourhood,Latitude,Longitude
1,L4C,0.014286,1,Richmond hill (Southwest),43.86364,-79.43938


## Results and Discussion <a name="results"></a>

The results from the K-means clustering show that we have 2 clusters: neighbourhoods without pubs (red) and neighbourhoods with pubs (purple).

Based on what has been presented in the results section, all the pubs are concentrated in the southwest of Richmond hill according to the dataset we obtained. The reason may be that the downtown of Richmond hill is in southwest of the city along Yonge street.  
Therefore, there is a great opportunity to open new pubs in the cluster which no pubs in, such as southeast, central, Oak Rideges/ Lake Wilcox / Temperaneceville areas of the city. 


## Conclusion <a name="conclusion"></a>

According to what we analyzed using machine learning techniques, we find that we could choose a location in the southeast, central, Oak Rideges/ Lake Wilcox / Temperaneceville areas of Richmond hill.  However, there are some limitations in the project.
We only have four major neighbourhoods. Each neighbourhood has a large area in the city, so it is not a best way to find an optimal location for a pub. If we have enough data sources, it is better to use small communities as our dataframe. Also, there are many factors need to be considered for finding an optimal pub location, such as population, incomes, age, education and occupation. However, we only considered one factor which is frequency of occurrence of pubs. Therefore, for further better research, we would better to build a model that contains more correlated factors to obtain an more accurate result. 
