# Week 5: Applied Data Science Capstone Report

Now that you have been equipped with the skills and the tools to use location data to explore a geographical location, over the course of two weeks, you will have the opportunity to be as creative as you want and come up with an idea to leverage the Foursquare location data to explore or compare neighborhoods or cities of your choice or to come up with a problem that you can use the Foursquare location data to solve. If you cannot think of an idea or a problem, here are some ideas to get you started:

In Module 3, we explored New York City and the city of Toronto and segmented and clustered their neighborhoods. Both cities are very diverse and are the financial capitals of their respective countries. One interesting idea would be to compare the neighborhoods of the two cities and determine how similar or dissimilar they are. Is New York City more like Toronto or Paris or some other multicultural city? I will leave it to you to refine this idea.

In a city of your choice, if someone is looking to open a restaurant, where would you recommend that they open it? Similarly, if a contractor is trying to start their own business, where would you recommend that they setup their office?
These are just a couple of many ideas and problems that can be solved using location data in addition to other datasets. No matter what you decide to do, make sure to provide sufficient justification of why you think what you want to do or solve is important and why would a client or a group of people be interested in your project.

Review criteria

This capstone project will be graded by your peers. This capstone project is worth 70% of your total grade. The project will be completed over the course of 2 weeks. Week 1 submissions will be worth 30% whereas week 2 submissions will be worth 40% of your total grade.

For this week, you will required to submit the following:

* A description of the problem and a discussion of the background. (15 marks)
* A description of the data and how it will be used to solve the problem. (15 marks)

For the second week, the final deliverables of the project will be:

* A link to your Notebook on your Github repository, showing your code. (15 marks)
* A full report consisting of all of the following components (15 marks):
    * Introduction where you discuss the business problem and who would be interested in this project.
    * Data where you describe the data that will be used to solve the problem and the source of the data.
    * Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
    * Results section where you discuss the results.
    * Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
    * Conclusion section where you conclude the report.
* Your choice of a presentation or blogpost. (10 marks)

---------------------------------------------------------

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 4>

1. <a href="#item1">Introductione</a>

2. <a href="#item2">Data</a>

3. <a href="#item3">Methodology</a>

1. <a href="#item4">Results</a>

2. <a href="#item5">Discussion</a>

3. <a href="#item6">Conclusion</a>

</font>
</div>

<a id='item1'></a>
## Introduction
_Discussion of the business problem and who would be interested in this project._

For my project, I have decided to tackle the question of determining which Toronto neighborhood would be most suitable for the opening of a new franchise for a restaurant based on the current density of similar places.  Neighborhoods with lower restaurant densities would provide less competition for the new franchise.  This would be of interest to those in charge of a company seeking to expand its operations into the Toronto area.

---------------------------------------------------------

<a id='item2'></a>
## Data
_Description of the data that will be used to solve the problem and the source of the data._

To assess which Toronto neighborhood would most benefit from the addition of a new restaurant, it would be necessary to determine which neighborhood has the lowest density of current restaurants.  This requires a couple of parameters:

* the total number of restaurants in the neighborhood

* the total area of the neighborhood

By finding the ratio of these two numbers, it is possible to determine which neighborhood has the lowest restaurant density.  Therefore, it is possible to determine the neighborhood in which the company will have the least competition for consumers.

While the area of each neighborhood is a matter of simple geometry, the number of venues within its borders is a little more difficult to determine.  By leveraging the Foursquare API, it is possible to determine the number of venues with a type similar to "restaurant" for each neighborhood and thereby determine the baseline competition the new franchise will encounter in each neighborhood.

---------------------------------------------------------

<a id='item3'></a>
## Methodology
_The main component of the report, with discussion and description of any exploratory data analysis and inferential statistical testing was performed and what machine learning techniques were used and why._

To answer this question, I began by associating the neighborhoods near each postal code in the Toronto area with the latitude and longitude of the center of the postal code's designated area.  I accomplished this by first scraping the list of Toronto postal codes on Wikipedia (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) using BeautifulSoup.  This involved identifying the HTML tags associated with the table, namely \<tbody\>, associating the elements of the *.contents attribute with the table's rows, and appending each row to an array of strings.  In order to transmute this array into a DataFrame, I split the string corresponding to each row into its three component cells, corresponding to a triple consisting of a postcode, borough, and neighborhood.  In order to facilitate integration with the Foursquare API, I opted to group together the neighborhoods which shared a postcode before associating each group with the latitude and longitude of its center.  These mappings were obtained by reusing the Geospatial_Coordinates.csv file from the week 3 assignment.

After initializing my connection to the Foursquare API, I configured the parameters defining the maximum number of venues to return for each query and the radius around each latitude/longitude pair in which venues should be located.  I eventually settled on a venue maximum of 1000 and a radius of 1000 meters.  After obtaining the results from Foursquare as a JSON object, I then extracted the category for each venue.  After obtaining the venues for each neighborhood group, I then used one-hot encoding to analyze the categories present in each neighborhood.  I filtered the categories to only retain those indicating a type of restaurant before summing the total number of restaurants associated with each group of neighborhoods.

Finally, to obtain the restaurant density for each group, I simply divided the total number of restaurants by the area over which the Foursquare API searched for venues; in this case, that area is 1000 m * 1000 m * π = π * 10^6 m^2 = π km^2.  While the densities produced are quite small, as they are given in units of restaurants per square meter, the sorting of the neighborhoods by density is independent of the units used.

---------------------------------------------------------

<a id='item4'></a>
## Results
_Discussion of the results._

The results of my analysis revealed that the following neighborhoods all have the minimum restaurant density:

* Port Union, Rouge Hill, Highland Creek
* Malvern, Rouge
* West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
* Weston
* Scarborough Village
* Humberlea, Emery
* York Mills, Silver Hills

while these neighborhoods all have the same maximum value for theirs:

* Toronto Dominion Centre, Design Exchange
* St. James Town
* Victoria Hotel, Commerce Court
* Garden District, Ryerson
* Union Station, Toronto Islands, Harbourfront East
* Underground city, First Canadian Place
* Richmond, King, Adelaide

To have such a large number of neighborhoods tied for both density extrema is an artifact of the ultimately discrete nature of the Foursquare API.  Specifically, the minimum value of 3.183099e-07 restaurants per m^2 corresponds to only one restaurant in the search area, while the maximum of 3.183099e-05 restaurants per m^2 corresponds to 100 restaurants in the search area.  The minimum seems sensible, as there is almost always a restaurant within a kilometer of any given location in a major city.  However, the repetition of the maximum value is curious; there doesn't appear to be a reason behind the 100-restaurant maximum inherent in one's intuition about urban environments.  In this respect, I suspect that this is some sort of internal limitation of my free Foursquare developer account.

---------------------------------------------------------

<a id='item5'></a>
## Discussion
_Discussion of any observations and recommendations based on the results._

The list of 

---------------------------------------------------------

<a id='item6'></a>
## Conclusion
_Conclude the report._

---------------------------------------------------------