# The Battle of the Neighborhoods

### Applied Data Science Capstone Project - Mike Mander

## Table of contents
* [Introduction](#introduction)
* [Data](#data)

## Introduction <a name="introduction"></a>

Moving to a new place to for a job is hard, having to relocate with a young family creates additional concerns and considerations that have to be factored in. Choosing where to live when you arrive in a new place is one of the most important decisions, and usually you know little to nothing about the city, services, transportation, demographics etc. Making a bad choice at this stage can result in long commutes, resentment about the job and an unhappy social/family life. I will seek to utilise publicly available data and data science skills to attempt to answer the question, Where is the best location for me and my family to live?

My project will assume a hypothetical scenario where a family with young children are moving to Singapore for one parent to start a new job at the IBM office. They have been given a housing allowance, and a salary with some leeway to spend additional money for the right place, if required. 

This is a typical scenario for many expats and so I think that any international company's HR department and new expat employees themselves will be interested in the analytical process and the results. The scenario I am going to resolve for is one of a few potential options, and so I will suggest, during the data analysis, where choices can be made that would suit expats in slightly different situations. 


## Data <a name="data"></a>

Singapore has a wealth of publicly available government data on their website https://data.gov.sg. I have mined this to get the geoJSON data for their neighbourhoods, or planning areas as they call them. This splits Singapore into 55 areas, and I will use a library called <a href="https://pypi.org/project/Shapely/">Shapely</a> to find the centroid location coordinates of each area. From those I will utilise the <a href="https://foursquare.com">FourSquare</a> API to search for venues within each area that are related to children, kids and babies. In the <a href="https://developer.foursquare.com/docs/resources/categories">Developer Categories list</a> I have found many venue types that are related to or would be of interest to parents with young children, including playgrounds, parks, baby shops etc. The full list of IDs will be described in the methodology later. 

The government website also has a few other datasets that I will be using for the analysis. There is a table containing all the schools in the Singapore area and the age groups that they cater for, as well as a separate table with the locations of all the schools, I will have to clean up the data in each table and then join the tables together to get the results in the format that I want. I have decided to use this as I want to give the best coverage of pre-schools/nurseries and I am not sure if FourSquare will have all the schools listed. 

Shown below are the column headers of the data set for the schools listing:

In [17]:
import pandas as pd

df_school_locs = pd.read_csv('Data/listing-of-centre.csv')
df_school_locs.dtypes

tp_code                     object
centre_code                 object
centre_name                 object
organisation_code           object
organisation_description    object
service_model               object
centre_contact_no           object
centre_email_address        object
centre_address              object
postal_code                  int64
centre_website              object
infant_vacancy              object
pg_vacancy                  object
n1_vacancy                  object
n2_vacancy                  object
k1_vacancy                  object
k2_vacancy                  object
food_offered                object
second_languages_offered    object
spark_certified             object
weekday_full_day            object
saturday                    object
scheme_type                 object
extended_operating_hours    object
provision_of_transport      object
government_subsidy          object
gst_regisration             object
last_updated                object
remarks             

I will be using the centre address from this data set and reverse geocoding the location coordinates, then joining with the school service listing based on the centre code to filter the offerings to pre-school and nursery only. 

The first 5 rows of the schools service listing is shown below:

In [18]:
df_school_serv = pd.read_csv('Data/listing-of-services.csv')
df_school_serv.head()

Unnamed: 0,centre_code,centre_name,class_of_licence,type_of_service,levels_offered,fees,type_of_citizenship,last_updated,remarks
0,EB0001,E-BRIDGE PRE-SCHOOL PTE. LTD.,Class B (Child Care),Full Day,Kindergarten 1 (5 yrs old),1080.0,Others,2019-01-04,na
1,EB0001,E-BRIDGE PRE-SCHOOL PTE. LTD.,Class B (Child Care),Full Day,Kindergarten 1 (5 yrs old),720.0,SC,2019-01-04,na
2,EB0001,E-BRIDGE PRE-SCHOOL PTE. LTD.,Class B (Child Care),Full Day,Kindergarten 1 (5 yrs old),900.0,SPR,2019-01-04,na
3,EB0001,E-BRIDGE PRE-SCHOOL PTE. LTD.,Class B (Child Care),Full Day,Kindergarten 2 (6 yrs old),1080.0,Others,2019-01-04,na
4,EB0001,E-BRIDGE PRE-SCHOOL PTE. LTD.,Class B (Child Care),Full Day,Kindergarten 2 (6 yrs old),720.0,SC,2019-01-04,na


Another data set I will seek to feature is the locations of the MRT stations around Singapore and the fare costs. Public transport is very efficient in Singapore and coupled with the very high costs associated with owning a vehicle living within close proximity to an MRT station will be beneficial. I will have to decide how to weigh the proximity to an MRT station against better locations for children that might be further away. Based on the location of IBM office in Singapore, and the best location for the family to live, commute times and fare costs might have to come into consideration. These distances and fare costs are another public data set. 

The final data set that I will be using is for an indicative look at the cost of renting, it contains the median rent by area and flat type per quarter from Q2-2005 to Q4 -2018. This data can be interrogated and cleaned to show the prospective expat the expected required budget for the chosen areas as that could potentially influence the decision on where to live. 