# Capstone Project - The Battle of Neighborhoods

Applied Data Science Capstone from IBM (on Coursera.org)

## 1. Introduction

### 1.1 Background

In "Applied Data Science Capstone" course, we explored, segmented, and clustered the neighborhoods in the city of Toronto based on venues in each postcode area.

This capstone project will continue the same analysis with additional features (such as elementary and secondary schools' ratings, 2016 census data from Statistic Canada, and housing data from HouseSigma) to better describe the characteristics of each neighborhood.

### 1.2 Problem

There are many factors that could characterized a neighborhood besides venues or interest points.

Other factors could include, but not limited to, house and rental pricing, availability of good elementary and scondary schools for the children, average family incomes for selection reference.

This analysis tries to answer the question: Which neighborhoods have the similar features  from where a family with young children can purchase a residential property?

### 1.3 Interest

Families with young children of school age who are looking for a residential property in City of Toronto.

Note: _Residential property or house_ referring here could be a detached house, semi-detached house, townhouse, or apartment.

## 2. Data Acquisition and Cleaning

### FSA Postcodes of City of Toronto
[https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

HTML scraping to get all Forward Sortation Area (FSA) postcodes, boroughs, and neighbourhoods in City of Toronto.

These data cleaning steps will be applied:
- The dataframe will consist of three columns: Postcode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

Note: The FSA postcodes are the first three characters of Canadian postal codes.  For example, _M5H_ is the FSA postcode for the postal code '_M5H_ 2N2' (where Toronto City Hall is located).

In addition, the latitude and longitudes coordinates of each FSA postcode will be loaded from [http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv](http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv) file.

### Foursquare Location Data

Foursquare API will be used to find out the following in a postcode area:
- Venues: Limited to maximum 100 venues
- Schools (education)
- Subway stations (transportation): How many subway stations in 800m (10 min walking distance)
- Bus stops (trasportation): How many bus stops in 800m (10 min walking distnace)

### Fraser Institue - School Ranking

HTML scraping to find out school rankings:
- Elemntary: [http://ontario.compareschoolrankings.org/elementary/SchoolsByRankLocationName.aspx?schooltype=elementary](http://ontario.compareschoolrankings.org/elementary/SchoolsByRankLocationName.aspx?schooltype=elementary)
- Seconary: [http://ontario.compareschoolrankings.org/secondary/SchoolsByRankLocationName.aspx?schooltype=secondary](http://ontario.compareschoolrankings.org/secondary/SchoolsByRankLocationName.aspx?schooltype=secondary)

School rankings are between 0 to 10.  Division by 10 (ten) will be needed to normalize school rankings for clustering.

We will ignore schools without school rankings from this data source.

### HouseSigma - Median Price Report

[https://housesigma.com/site/en/analyze/city-median-price-report](https://housesigma.com/site/en/analyze/city-median-price-report)

HTML scraping to collect median pricing for detached houses, semi-detached houses, freehold townhouses, condo townhouses, and codo appartments in Canadian dollars.

Data in this report are provided in cities and their communities, not in FSA postcodes.  However, these communities have naming close to neighborhood from FSA postcode data source.  But some cleaning will still be needed, for example, the report of _Wexford-Maryvale_ community in  Scarborough (borough) is broken down into two neighborhoods: _Wexford Heights_ and _Maryvale_ in FSA postcodes above.

### Statistics Canada - 2016 Census

HTML scraping to find out demographic and economic data (such as average family income) in a FSA postcode area.

For example, hyperlink for M5H postcode (as Code1 parameter in the URL):
[https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/page.cfm?Lang=E&Geo1=FSA&Code1=M5H&Geo2=PR&Code2=35&SearchText=M6H&SearchType=Begins&SearchPR=01&B1=All&TABID=2&type=0](https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/page.cfm?Lang=E&Geo1=FSA&Code1=M5H&Geo2=PR&Code2=35&SearchText=M6H&SearchType=Begins&SearchPR=01&B1=All&TABID=2&type=0)