<H1>Capstone Project - The Battle of the Neighborhoods (Week 2)</H1>
<H3>Applied Data Science Capstone by IBM/Coursera<H3>

<H2>Introduction: Business Problem </H2>
    
In this project, we will try to analyze an optimal location to start up a coffee shop in Hong Kong, China. This report will be targeted to stakholders who would invest a coffee shop or plan to expend a coffee shop in city Hong Kong, where having a coffee craze.

Sincer there are lots of coffee shops in Hong Kong, the project will try to detect locations that are not already crowded with coffee shop. We are also particularly interested in areas with no coffee shop in vicinity. We would also prefer locations as close to city center as possible, assuming that first two conditions are met.

Using data science methodology and clustering by machine learning, this project would provide a suggestion to answer the business question: Where would be the recommended location to invest a coffee shop in Hong Kong, China.    

In [184]:
from bs4 import BeautifulSoup
import numpy as np
import requests
import pandas as pd

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder # to get coordinates

from arcgis.gis import GIS
from arcgis.geocoding import geocode, reverse_geocode
from arcgis.geometry import Point

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium # map rendering library

print("Library import completed")

Library import completed


<H2>2. Scrap data from Wikipedia page into a DataFrame</H2>

In [185]:
# send the GET request
url = requests.get("https://en.wikipedia.org/wiki/Districts_of_Hong_Kong").text

In [187]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(url, 'html.parser')

In [188]:
# create a list to store neighborhood data
neighborhoodList = []

In [189]:
# append the data into the list
for row in soup.find_all("table", class_="multicol")[0].findAll("p"):
    neighbourhood = row.text.split("\n")
    
    for district in neighbourhood:
        if len(district) > 0 :
            if district[2] == " ": 
                neighborhoodList.append(district[3:])
            else :
                neighborhoodList.append(district[4:])
                

In [190]:

# create a new DataFrame from the list
cf_df = pd.DataFrame({"Neighborhood": neighborhoodList})

cf_df.head()

Unnamed: 0,Neighborhood
0,Islands
1,Kwai Tsing
2,North
3,Sai Kung
4,Sha Tin


In [191]:
cf_df.shape

(18, 1)

<H2>3. Get the geographical coordinates<H2>

In [192]:
# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Hong Kong, China'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [193]:
# call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(neighborhood) for neighborhood in cf_df["Neighborhood"].tolist() ]

In [194]:
coords

[[22.257879200000048, 113.97076950000007],
 [22.349590500000033, 114.11746390000008],
 [22.51488050000006, 114.18429820000006],
 [22.398258200000043, 114.32004830000005],
 [22.379549600000075, 114.19856240000001],
 [22.29086826446604, 113.95226341504485],
 [22.370660000000044, 114.1047900000001],
 [22.396709000000044, 113.97562360000006],
 [22.454008000000044, 114.04826070000001],
 [22.31113000000005, 114.18354000000011],
 [22.31544320000006, 114.22606420000011],
 [22.329350805367028, 114.15917854227246],
 [22.33666500000004, 114.19197200000008],
 [22.30973890000007, 114.16852090000009],
 [22.28219000000007, 114.14486000000011],
 [22.272090400000025, 114.22139600000003],
 [22.25801000000007, 114.15308000000005],
 [22.268839700000058, 114.1827181000001]]

In [195]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [196]:
# merge the coordinates into the original dataframe
cf_df['Latitude'] = df_coords['Latitude']
cf_df['Longitude'] = df_coords['Longitude']

In [197]:
# check the neighborhoods and the coordinates
print(cf_df.shape)
cf_df

(18, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Islands,22.257879,113.97077
1,Kwai Tsing,22.349591,114.117464
2,North,22.514881,114.184298
3,Sai Kung,22.398258,114.320048
4,Sha Tin,22.37955,114.198562
5,Tai Po,22.290868,113.952263
6,Tsuen Wan,22.37066,114.10479
7,Tuen Mun,22.396709,113.975624
8,Yuen Long,22.454008,114.048261
9,Kowloon City,22.31113,114.18354


<H2>4. Create a map of Hong Kong with neighborhoods superimposed on top</H2>

In [198]:
# get the coordinates of Kuala Lumpur
address = 'Hong Kong, China'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Hong Kong, China {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Hong Kong, China 22.2793278, 114.1628131.


In [199]:
# create map of Toronto using latitude and longitude values
map_hk = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(cf_df['Latitude'], cf_df['Longitude'], cf_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_hk)  
    
map_hk

<H2>5. Use the Foursquare API to explore the neighborhoods</H2>

Now, let's get the top 100 venues that are within a radius of 2000 meters.

In [200]:
radius = 2000
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(kl_df['Latitude'], kl_df['Longitude'], kl_df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [201]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(1291, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Islands,22.257879,113.97077,Sunset Peak (大東山),22.256208,113.955868,Mountain
1,Islands,22.257879,113.97077,Mau Kee Restaurant 茂記中西餐廳,22.245694,113.978654,Chinese Restaurant
2,Islands,22.257879,113.97077,Lantau Trail (Section 2) (鳳凰徑(第二段)),22.25633,113.986855,Trail
3,Islands,22.257879,113.97077,Nam Shan Camp Site 南山營地,22.253278,113.986194,Campground
4,Islands,22.257879,113.97077,The Water Buffalo British Restaurant & Brewpub...,22.24396,113.97884,Restaurant


Let's check how many venues were returned for each neighorhood

In [202]:
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Central & Western,100,100,100,100,100,100
Eastern,100,100,100,100,100,100
Islands,6,6,6,6,6,6
Kowloon City,100,100,100,100,100,100
Kwai Tsing,40,40,40,40,40,40
Kwun Tong,100,100,100,100,100,100
North,6,6,6,6,6,6
Sai Kung,8,8,8,8,8,8
Sha Tin,98,98,98,98,98,98
Sham Shui Po,100,100,100,100,100,100


Let's find out how many unique categories can be curated from all the returned venues

In [203]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 191 uniques categories.


In [204]:
# print out the list of categories
venues_df['VenueCategory'].unique()[:50]

array(['Mountain', 'Chinese Restaurant', 'Trail', 'Campground',
       'Restaurant', 'Spanish Restaurant', 'French Restaurant',
       'Fast Food Restaurant', 'Park', 'Tunnel', 'Multiplex',
       'Waterfront', 'Asian Restaurant', 'Shanghai Restaurant',
       'Japanese Restaurant', 'Pizza Place', 'Theater', 'Coffee Shop',
       'Malay Restaurant', 'Shopping Mall', 'Track Stadium',
       'Department Store', 'Café', 'Clothing Store', 'Hotel',
       'Vietnamese Restaurant', 'Athletics & Sports', 'Grocery Store',
       'Convenience Store', 'Sushi Restaurant', 'Ramen Restaurant',
       'Gym / Fitness Center', 'Scenic Lookout', 'Farm', 'Garden',
       'Tourist Information Center', 'Bus Station', 'Snack Place',
       'Harbor / Marina', 'Supermarket', 'Train Station',
       'Dumpling Restaurant', 'Temple', 'Theme Park',
       'Seafood Restaurant', 'Ice Cream Shop', 'Electronics Store',
       'Noodle House', 'Buffet', 'Cantonese Restaurant'], dtype=object)

In [205]:
# check if the results contain "Shopping Mall"
"Neighborhood" in venues_df['VenueCategory'].unique()

True

<H2>6. Analyze Each Neighborhood</H2>

In [206]:
# one hot encoding
kl_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
kl_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [kl_onehot.columns[-1]] + list(kl_onehot.columns[:-1])
kl_onehot = kl_onehot[fixed_columns]

print(kl_onehot.shape)
kl_onehot.head()

(1291, 192)


Unnamed: 0,Neighborhoods,Accessories Store,American Restaurant,Antique Shop,Arcade,Art Gallery,Art Museum,Asian Restaurant,Astrologer,Athletics & Sports,...,Vietnamese Restaurant,Waterfall,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Yoga Studio,Yunnan Restaurant,Zhejiang Restaurant,Zoo
0,Islands,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Islands,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Islands,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Islands,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Islands,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [207]:
kl_grouped = kl_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(kl_grouped.shape)
kl_grouped

(18, 192)


Unnamed: 0,Neighborhoods,Accessories Store,American Restaurant,Antique Shop,Arcade,Art Gallery,Art Museum,Asian Restaurant,Astrologer,Athletics & Sports,...,Vietnamese Restaurant,Waterfall,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Yoga Studio,Yunnan Restaurant,Zhejiang Restaurant,Zoo
0,Central & Western,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,...,0.01,0.0,0.0,0.0,0.03,0.02,0.03,0.0,0.0,0.0
1,Eastern,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,...,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0
2,Islands,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Kowloon City,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Kwai Tsing,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.025,...,0.025,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Kwun Tong,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Sai Kung,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Sha Tin,0.0,0.0,0.0,0.0,0.0,0.0,0.020408,0.0,0.0,...,0.010204,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Sham Shui Po,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [208]:
len(kl_grouped[kl_grouped["Coffee Shop"] > 0])

15

Create a new DataFrame for Cafe data only

In [209]:
kl_cafe = kl_grouped[["Neighborhoods","Coffee Shop"]]

In [210]:
kl_cafe.head()

Unnamed: 0,Neighborhoods,Coffee Shop
0,Central & Western,0.06
1,Eastern,0.03
2,Islands,0.0
3,Kowloon City,0.05
4,Kwai Tsing,0.05


<H2> 7. Cluster Neighborhoods </H2>

Run k-means to cluster the neighborhoods in Hong Kong into 3 clusters.

In [211]:
# set number of clusters
kclusters = 3

kl_clustering = kl_cafe.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(kl_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 2, 1, 2, 2, 0, 1, 1, 2, 0])

In [212]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
kl_merged = kl_cafe.copy()

# add clustering labels
kl_merged["Cluster Labels"] = kmeans.labels_

In [213]:
kl_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
kl_merged.head()

Unnamed: 0,Neighborhood,Coffee Shop,Cluster Labels
0,Central & Western,0.06,0
1,Eastern,0.03,2
2,Islands,0.0,1
3,Kowloon City,0.05,2
4,Kwai Tsing,0.05,2


In [214]:
# merge hk_grouped with hk_data to add latitude/longitude for each neighborhood
kl_merged = kl_merged.join(kl_df.set_index("Neighborhood"), on="Neighborhood")

print(kl_merged.shape)
kl_merged.head() # check the last columns!

(18, 5)


Unnamed: 0,Neighborhood,Coffee Shop,Cluster Labels,Latitude,Longitude
0,Central & Western,0.06,0,22.28219,114.14486
1,Eastern,0.03,2,22.27209,114.221396
2,Islands,0.0,1,22.257879,113.97077
3,Kowloon City,0.05,2,22.31113,114.18354
4,Kwai Tsing,0.05,2,22.349591,114.117464


In [215]:
# sort the results by Cluster Labels
print(kl_merged.shape)
kl_merged.sort_values(["Cluster Labels"], inplace=True)
kl_merged

(18, 5)


Unnamed: 0,Neighborhood,Coffee Shop,Cluster Labels,Latitude,Longitude
0,Central & Western,0.06,0,22.28219,114.14486
15,Wong Tai Sin,0.07,0,22.336665,114.191972
13,Tuen Mun,0.061538,0,22.396709,113.975624
5,Kwun Tong,0.09,0,22.315443,114.226064
12,Tsuen Wan,0.060606,0,22.37066,114.10479
9,Sham Shui Po,0.07,0,22.329351,114.159179
11,Tai Po,0.078431,0,22.290868,113.952263
2,Islands,0.0,1,22.257879,113.97077
6,North,0.0,1,22.514881,114.184298
7,Sai Kung,0.0,1,22.398258,114.320048


Finally, let's visualize the resulting clusters

In [170]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(kl_merged['Latitude'], kl_merged['Longitude'], kl_merged['Neighborhood'], kl_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [171]:
# save the map as HTML file
map_clusters.save('map_clusters.html')

<H2> 8. Examine Clusters </H2>

Cluster 0

In [172]:
kl_merged.loc[kl_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Coffee Shop,Cluster Labels,Latitude,Longitude
0,Central & Western,0.06,0,22.28219,114.14486
15,Wong Tai Sin,0.07,0,22.336665,114.191972
13,Tuen Mun,0.061538,0,22.396709,113.975624
5,Kwun Tong,0.09,0,22.315443,114.226064
12,Tsuen Wan,0.060606,0,22.37066,114.10479
9,Sham Shui Po,0.07,0,22.329351,114.159179
11,Tai Po,0.078431,0,22.290868,113.952263


Cluster 1

In [173]:
kl_merged.loc[kl_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Coffee Shop,Cluster Labels,Latitude,Longitude
2,Islands,0.0,1,22.257879,113.97077
6,North,0.0,1,22.514881,114.184298
7,Sai Kung,0.0,1,22.398258,114.320048


Cluster 2

In [174]:
kl_merged.loc[kl_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Coffee Shop,Cluster Labels,Latitude,Longitude
14,Wan Chai,0.05,2,22.26884,114.182718
8,Sha Tin,0.040816,2,22.37955,114.198562
16,Yau Tsim Mong,0.05,2,22.309739,114.168521
4,Kwai Tsing,0.05,2,22.349591,114.117464
3,Kowloon City,0.05,2,22.31113,114.18354
1,Eastern,0.03,2,22.27209,114.221396
10,Southern,0.04,2,22.25801,114.15308
17,Yuen Long,0.055556,2,22.454008,114.048261


Observations:

Most of the coffee shop are concentrated in the central area of Hong Kong, with the highest number in cluster 0 and moderate number in cluster 2. On the other hand, cluster 1 has very low number to totally no coffee shop in the neighborhoods. This represents a great opportunity and high potential areas to open new coffee shop as there is very little to no competition from existing. Therefore, this project recommends enterprisen to capitalize on these findings to open new coffee shop in neighborhoods in cluster 1 with little to no competition. Investor with unique selling propositions to stand out from the competition can also open new coffee shop in neighborhoods in cluster 0 with moderate competition. Lastly, it is advised to avoid neighborhoods in cluster 0 which already have high concentration of coffee shop and suffering from intense competition.