# The Battle of Neighborhoods
### --- What kind of restaurant should I open in Hong Kong?

In [42]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np

# more libraries to be imported ... 

****
## 1) Introduction

This project aims to give suggestions about what exact kind of restaurant will 
be the most ideal choice to be open in each district in Hong Kong (18 districts in total) 
by processing and analyzing data.

The data used in the project is mainly obtained from Wikipedia and Foursquare.
More details about the data will be discussed in the next section.

To achieve what I set out to do, the following steps will be followed:
1. Cluster 18 districts into 4 groups based on the similarities between districts and 
details of each district including the population density, the medium of family monthly
income, popular venues, etc.
2. Find a particular kind of restaurant which relatively rarely exists in one district
but is popular (high score or frequent occurrence) in other districts in the same group.
3. Conclude the result.


****
## 2) Data

This section gives details about the data which is used in the project.

Data is mainly derived from four sources.

### - Table1: Districts list and other basic information

This table contains the name of 18 districts and the population, are, density
and region details of each district. This table is scraped from relevant
Wikipedia page: https://en.wikipedia.org/wiki/Districts_of_Hong_Kong#Population

In [46]:
# Grab data from wikipedia
url='https://en.wikipedia.org/wiki/Districts_of_Hong_Kong#Population'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tb = soup.find('table', class_='wikitable sortable')
# Get columns information
columns = []
for attributes in tb.find_all('th'):
    columns.append(attributes.get_text().strip('\n'))
# Get rows information
data_matrix = []
for rows in tb.find_all('tr'):
    data_vector = []
    for cell in rows.find_all('td'):
        data_vector.append(cell.get_text().strip('\n'))
    data_matrix.append(data_vector)    
data_matrix.pop(0)
data_matrix = np.array(data_matrix)
# Turn it into dataframe
df_data1 = pd.DataFrame(data_matrix, columns = columns)
df_data1

Unnamed: 0,District,Chinese,Population[when?] [6],Area(km²),Density(/km²),Region
0,Central and Western,中西區,244600,12.44,19983.92,Hong Kong Island
1,Eastern,東區,574500,18.56,31217.67,Hong Kong Island
2,Southern,南區,269200,38.85,6962.68,Hong Kong Island
3,Wan Chai,灣仔區,150900,9.83,15300.1,Hong Kong Island
4,Sham Shui Po,深水埗區,390600,9.35,41529.41,Kowloon
5,Kowloon City,九龍城區,405400,10.02,40194.7,Kowloon
6,Kwun Tong,觀塘區,641100,11.27,56779.05,Kowloon
7,Wong Tai Sin,黃大仙區,426200,9.3,45645.16,Kowloon
8,Yau Tsim Mong,油尖旺區,318100,6.99,44864.09,Kowloon
9,Islands,離島區,146900,175.12,825.14,New Territories


### - Table2: The information about levels of income in each district

The second table presents the medium of the family monthly income in each district. This
data is from a piece of news http://www.orangenews.hk/finance/system/2017/02/28/010053713.shtml.
The original information is an image so I manually noted it down and turned it into a csv file
and stored it in IncomeHK.csv.


In [48]:
df_data2 = pd.read_csv('IncomeHK.csv')
df_data2


Unnamed: 0,District,Income
0,Central and Western,36000
1,Eastern,29830
2,Southern,30000
3,Wan Chai,37750
4,Sham Shui Po,20000
5,Kowloon City,25550
6,Kwun Tong,20160
7,Wong Tai Sin,22000
8,Yau Tsim Mong,23500
9,Islands,27700


### - Table3: The coordinates of districts

This table contains information of coordinates of 18 districts. Each coordinate is obtained
from each district's own wiki page. I manually noted it down and turned it into a csv file
and stored it in CoordinatesHK.csv.

In [49]:
df_data3 = pd.read_csv('CoordinatesHK.csv')
df_data3

Unnamed: 0,District,Latitude,Longitude
0,Central and Western,22.28666,114.15497
1,Eastern,22.28411,114.22414
2,Southern,22.24725,114.15884
3,Wan Chai,22.27968,114.17168
4,Sham Shui Po,22.33074,114.1622
5,Kowloon City,22.3282,114.19155
6,Kwun Tong,22.31326,114.22581
7,Wong Tai Sin,22.33353,114.19686
8,Yau Tsim Mong,22.32138,114.1726
9,Islands,22.26114,113.94608


### - Table4: The venues in each district

This data is from Foursquare API based on the coordinates information from the
Table3. Venues details in each district are included in the data.


In [55]:
# @hidden_cell
CLIENT_ID = 'SSS2KPOGP3JYSRAXNVGMAOJQ0K4LZHN1EN2S3W2PMRSPSZ1X' 
CLIENT_SECRET = 'ZUGXHFHCYCPGOIV2RZNAJSE4LP4NXQTIFKPV0QCSU2KCSXBL' 
VERSION = '20180605' 

venues_dict = {}
for index, row in df_data3.iterrows():
    
    neighborhood_latitude = row['Latitude']
    neighborhood_longitude = row['Longitude']
    LIMIT = 50 # limit of number of venues returned by Foursquare API
    radius = 500 # define radius
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        neighborhood_latitude, 
        neighborhood_longitude, 
        radius, 
        LIMIT)
    results = requests.get(url).json()
    venues_dict[row['District']] = results


****
## 3) Methodology

TODO...



