# CAPSTONE PROJECT: BATTLE OF THE NEIGHBORHOODS
Singapore Visitors and Expatriates Venue Recommendation


I. PURPOSE
This document provides the details of my final peer reviewed assignment for the IBM Data Science Professional Certificate program – Coursera Capstone.

II. INTRODUCTION
Singapore is a small country and one of the most visited countries in Asia. There are a lot of websites where travelers can check and retrieve recommendations of places to stay or visit. However, most of these websites provides recommendation simply based on usual tourist attractions or key residential areas that are mostly expensive or already known for travelers based on certain keywords like "Hotel", or "Backpackers" etc. The intention on this project is to collect and provide a data driven recommendation that can supplement the recommendation with statistical data. This will also be utilizing data retrieved from Singapore open data sources and FourSquare API venue recommendations.
The sample recommender in this notebook will provide the following use case scenario:
A person planning to visit Singapore as a Tourist or an Expat and looking for a reasonable accommodation.
The user wants to receive venue recommendation where he can stay or rent an HDB apartment with close proximity to places of interest or search category option.
The recommendation should not only present the most viable option, but also present a comparison table of all possible town venues.
For this demonstration, this notebook will make use of the following data:
Singapore Median Rental Prices by town.
Popular Food venues in the vicinity. (Sample category selection)
Note: While this demo makes use of Food Venue Category, Other possible categories can also be used for the same implementation such as checking categories like:
Outdoors and Recreation
Nightlife
Nearby Schools, etc.
I will limit the scope of this search as FourSquare API only allows 50 free venue query limit per day when using a free user access.

III. DATA ACQUISITION
This demonstration will make use of the following data sources:
Singapore Towns and median residential rental prices.
Data will retrieved from Singapore open dataset from median rent by town and flattype from https://data.gov.sg website.
The original data source contains median rental prices of Singapore HDB units from 2005 up to 2nd quarter of 2018. I will retrieve rental the most recent recorded rental prices from this data source (Q2 2018) being the most relevant price available at this time. For this demonstration, I will simplify the analysis by using the average rental prices of all available flat type.
Singapore Towns location data retrieved using Google maps API.
Data coordinates of Town Venues will be retrieved using google API. I also make use of MRT stations coordinate as a more important center of for all towns included in venue recommendations.
Singapore Top Venue Recommendations from FourSquare API
(FourSquare website: www.foursquare.com)
I will be using the FourSquare API to explore neighborhoods in selected towns in Singapore. The Foursquare explore function will be used to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. The following information are retrieved on the first query:
Venue ID
Venue Name
Coordinates : Latitude and Longitude
Category Name
Another venue query will be performed to retrieve venue ratings for each location. Note that rating information is a paid service from FourSquare and we are limited to only 50 queries per day. With this constraint, we limit the category analysis with only one type for this demo. I will try to retrieve as many ratings as possible for each retrieved venue ID.

IV. METHODOLOGY
Singapore Towns List with median residential rental prices.
The source data contains median rental prices of Singapore HDB units from 2005 up to 2nd quarter of 2018. I will retrive the most recent recorded rental prices from this data source (Q2 2018) being the most relevant price available at this time. For this demonstration, I will simplify the analysis by using the average rental prices of all available flat type.
Data Cleanup and re-grouping. The retrieved table contains some un-wanted entries and needs some cleanup.
The following tasks will be performed:
Drop/ignore cells with missing data.
Use most current data record.
Fix data types.

III. DATA ACQUISITION This demonstration will make use of the following data sources: Singapore Towns and median residential rental prices. Data will retrieved from Singapore open dataset from median rent by town and flattype from https://data.gov.sg website. The original data source contains median rental prices of Singapore HDB units from 2005 up to 2nd quarter of 2018. I will retrieve rental the most recent recorded rental prices from this data source (Q2 2018) being the most relevant price available at this time. For this demonstration, I will simplify the analysis by using the average rental prices of all available flat type. Singapore Towns location data retrieved using Google maps API. Data coordinates of Town Venues will be retrieved using google API. I also make use of MRT stations coordinate as a more important center of for all towns included in venue recommendations. Singapore Top Venue Recommendations from FourSquare API (FourSquare website: www.foursquare.com) I will be using the FourSquare API to explore neighborhoods in selected towns in Singapore. The Foursquare explore function will be used to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. The following information are retrieved on the first query: Venue ID Venue Name Coordinates : Latitude and Longitude Category Name Another venue query will be performed to retrieve venue ratings for each location. Note that rating information is a paid service from FourSquare and we are limited to only 50 queries per day. With this constraint, we limit the category analysis with only one type for this demo. I will try to retrieve as many ratings as possible for each retrieved venue ID.# 

Importing Python Libraries
This section imports required python libraries for processing data. 
While this first part of python notebook is for data acquisition, we will use some of the libraries make some data visualization.

In [1]:
!conda install -c conda-forge folium=0.5.0 --yes # comment/uncomment if not yet installed.
!conda install -c conda-forge geopy --yes        # comment/uncomment if not yet installed

import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

# Numpy and Pandas libraries were already imported at the beginning of this notebook.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium # map rendering library

import requests # library to handle requests
import lxml.html as lh
import bs4 as bs
import urllib.request

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.9.11  |       hecc5488_0         144 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    altair-3.2.0               |           py36_0         770 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.3 MB

The following NEW packages will be 

In [2]:
from IPython.display import HTML
import base64

# Extra Helper scripts to generate download links for saved dataframes in csv format.
def create_download_link( df, title = "Download CSV file", filename = "data.csv"):  
    csv = df.to_csv()
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    return HTML(html)

In [3]:
import zipfile
import os
!wget -q -O 'median-rent-by-town-and-flat-type.zip' "https://data.gov.sg/dataset/b35046dc-7428-4cff-968d-ef4c3e9e6c99/download"
zf = zipfile.ZipFile('./median-rent-by-town-and-flat-type.zip')
sgp_median_rent_by_town_data = pd.read_csv(zf.open("median-rent-by-town-and-flat-type.csv"))
sgp_median_rent_by_town_data.rename(columns = {'town':'Town'}, inplace = True)
sgp_median_rent_by_town_data.head()

Unnamed: 0,quarter,Town,flat_type,median_rent
0,2005-Q2,ANG MO KIO,1-RM,na
1,2005-Q2,ANG MO KIO,2-RM,na
2,2005-Q2,ANG MO KIO,3-RM,800
3,2005-Q2,ANG MO KIO,4-RM,950
4,2005-Q2,ANG MO KIO,5-RM,-


In [4]:
# Drop rows with rental price == 'na'.
sgp_median_rent_by_town_data_filter=sgp_median_rent_by_town_data[~sgp_median_rent_by_town_data['median_rent'].isin(['-','na'])]

# Take the most recent report which is "2018-Q2"
sgp_median_rent_by_town_data_filter=sgp_median_rent_by_town_data_filter[sgp_median_rent_by_town_data_filter['quarter'] == "2018-Q2"]

# Now that all rows reports are "2018-Q2", we dont need this column anymore.
sgp_median_rent_by_town_data_filter=sgp_median_rent_by_town_data_filter.drop(['quarter'], axis=1)

# Ensure that median_rent column is float64.
sgp_median_rent_by_town_data_filter['median_rent']=sgp_median_rent_by_town_data_filter['median_rent'].astype(np.float64)

In [5]:
singapore_average_rental_prices_by_town = sgp_median_rent_by_town_data_filter.groupby(['Town'])['median_rent'].mean().reset_index()
singapore_average_rental_prices_by_town

Unnamed: 0,Town,median_rent
0,ANG MO KIO,2033.333333
1,BEDOK,2087.5
2,BISHAN,2233.333333
3,BUKIT BATOK,1962.5
4,BUKIT MERAH,2162.5
5,BUKIT PANJANG,1737.5
6,CENTRAL,2450.0
7,CHOA CHU KANG,1933.333333
8,CLEMENTI,2263.333333
9,GEYLANG,2166.666667
