# Coursera Capstone - New Business Location Advisor

## Introduction and Business Problem

This project will provide advice to entrepreneurs who are considering opening a small business in the Greater Toronto Area (GTA). Specifically, it will attempt to identify neighbourhoods where the demand for the type of business they are considering may be the highest.

The approach will be to obtain data regarding the mix of businesses already established in each GTA neighbourhood. The entrepreneur will provide the type of business under consideration. Then each GTA neighbourhood will be compared against similar neighbourhoods in the city to see if there is currently an undersupply or oversupply of that business type in the neighbourhood.

It is important that the comparison be done against similar neighbourhoods and not against city-wide averages in order to take into account the diversity of Toronto neighbourhoods which is reflected in the mix of businesses they provide.

The results will be plotted on a street map of the city, allowing the entrepreneur to easily identify the neighbourhoods with the greatest undersupply and therefore those where demand for their new business is likely to be the greatest.

## Data

The data requirements of the project are:
- the geographical boundaries and identifying information for neighbourhoods in the GTA.
- data describing the businesses in each neighbourhood according to type, price range, and possibly other characteristics.

### Neighbourhoods

The idea of a 'neighbourhood' can be defined in various ways. For the purposes of this project, we will use postal codes. The first three characters of a Canadian postal code define a 'forward sortation area' (FSA), which is the final location from which mail is dispatched for delivery. Because Toronto has a large number of natural boundaries - rivers, ravines, parklands as well as the usual man-made barriers like major throroughfares, these generally provide a good correspondence to what Torontonians think of as 'neighbourhoods'. The FSA's in the GTA number approximately 100 which results in reasonably-sized neighbourhoods - not too large or too small.

Although the boundaries between forward sortation areas do often run along major thoroughfares, it can be noted by sampling google maps that for practical reasons, all of the addresses on a given major street are usually placed in the same FSA. This is good for our purposes as well, as it would be rather artificial to consider the businesses on one side of a major street to be in a different 'neighbourhood' from those on the other side.

Use of postal codes to define neighbourhoods provide the following additional benefits:
- the postal code for a business is readily available, and although occasionally it may be an off-site mailing address, it normally corresponds with the business's geographical location.
- the postal codes can be used directly to assign a business to a neighbourhood, so we do not need to rely on imprecise 'proximity to the neighbourhood center', or complicated geocode polygon testing to assign businesses to neighbourhoods based on their latitude and longitude.
- geographical data is available from Statistics Canada that precisely defines the boundaries of each FSA, allowing the neighbourhoods to be accurately depicted on a choropleth map.

There are a few FSA's which do not correspond with neighbourhoods - some are processing centers and others are 'Stations' used for business reply mail. We will exclude these from our neighbourhoods. Also, a few of Toronto's office towers, as well as the 'Toronto Underground' have their own FSA's. We will consider these to be a part of the geographical neighbourhood where they are located.

We will use three sources to define and identify the neighbourhoods:
- the Wikipedia page at https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M contains a list of forward sortation areas and the name of the borough and neighbourhood(s) they correspond to. The HTML can be scraped to get identifying information for each FSA that is more human-friendly than the 3-character FSA codes.
- we will use the ArcGIS site at https://www.arcgis.com/ to get the location of each FSA. This information will be used to place markers into the neighbourhoods.
- we will use the FSA geodata available at https://www150.statcan.gc.ca/n1/en/catalogue/92-179-X to enable us to plot the boundaries of each FSA on a choropleth map.

### Business Information

Although the course suggests obtaining business data from FourSquare, I found this to be unsatisfactory for a few reasons:
- the venue data includes a lot of locations that one would not really consider 'businesses', as well 'events' that are not really even 'venues'.
- postal code information is available for only about 75% of the venues.
- since the data is customer-sourced it is likely to be biased towards neighbourhoods and venues that are frequented by a specific 'highly connected' demographic.

For our use case involving small businesses, use of Yelp seems a better choice. It has virtually complete postal code information, and is specifically targeted to cover local businessses. The free account API limits are much smaller than FourSquare's, but manageable. It also has 'price range' information which I feel is critical. Like most other cities, Toroonto neighbourhoods tend to have residents that share a common income level, and this is an important factor in locating a business. For example, we would not want to recommend locating a high-end boutique in a lower income neighbourhood, or a thrift shop in a high income neighbourhood.

