# Applied Data Science Capstone
## Capstone notebook project &#x2001; by &#x2001; *Alejandro Casterá García*
### Week 05 - Assignment 01:
#### The Battle of Neighborhoods

## Table of contents  <a name="toc_hd5we7"></a>
* [Introduction](#introduction_hd5we7)
* [Data](#data_hd5we7)
* [Methodology](#methodology_hd5we7)
* [Results](#results_hd5we7)
* [Discussion](#discussion_hd5we7)
* [Conclusion](#conclusion_hd5we7)

<a name="introduction_hd5we7"></a><br><br><br><br><br><br>

---

## Introduction &#x2001;&#x2001;&#x2001;&#x2001;<a href="#toc_hd5we7">&#x21A5;</a>&#x2001;&#x2001;

* The client is an editorial company which already publishes several magazines and newspapers.
* They want to launch a new product; a short newspaper which is to be distributed free of charge in central places in the city of **Bogota, Colombia**.
* The revenues of this newspaper will **ONLY be based on advertising**.
* The target consists in highly-educated working professionals the advertisers can be interested in.
* The problem the client faces is that they don't know __which locations in the city would be the best suited to distribute the newspaper to the pedestrians__.

<a name="data_hd5we7"></a><br><br><br><br><br><br>

---

## Data &#x2001;&#x2001;&#x2001;&#x2001;<a href="#toc_hd5we7">&#x21A5;</a>&#x2001;&#x2001;

* The source of data is the **FourSquare API**.
* This source gives us **categorized venues** of all kinds around the coordinates of our choice.
* The idea is to build a **grid** of equally sized blocks in the city and get the venues for each block.
* Then we will keep ONLY those venues of **certain categories**.
* Examples of the data categories we will be using categories are:
> &bull; *Office*  
  &bull; *Business center*  
  &bull; *Convention center*  
  &bull; *Fair*  
  &bull; *Government building*  
  &bull; *etc*  
* The **coordinates** of those venues found will be the FEATURES for our Machine Learning model.
* Our Machine Learning model will be in charge of identifying **clusters** of those categorized venues within the city.
* Clusters will take into account the **DENSITY** of venues with business-like categories since we want to distribute the highest number of newspapers per location.
* These clusters will give the client the list of best locations in the city to start **distributing the newspaper**.

<a name="methodology_hd5we7"></a><br><br><br><br><br><br>

---

## Methodology &#x2001;&#x2001;&#x2001;&#x2001;<a href="#toc_hd5we7">&#x21A5;</a>&#x2001;&#x2001;

* As we have mentioned already, we are going to use the **FourSquare API**.
* When we use this API we need to provide a point in the map and a search radius; so this API returns results (venues) **inside of a circle**.
* We need to search venues in the metropolitan area of the city of **Bogotá, Colombia**.
* So the first thing we need to create is a **polygon** that encloses the whole area to cover, and divide the area in **equally-sized cells**.
* Normally these cells will have a rectangular shape, but since FourSquare works with circles, we need to use cells with circular shape.
* We decided to use the **QGIS** desktop app to generate the grid.
* QGIS does NOT support circular cells, but it supports **hexagonal** cells, which are the most similar ones to a circle.

* In the following images you can see the result (area to cover, generated grid):

<p float="left" align="middle">
  <img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/001_bogota_grid_off.jpg" width="400" />
  <img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/002_bogota_grid_on.jpg" width="400" /> 
</p>

* QGIS allowed us to calculate the **centroids** of each hexagonal cell.
* The **geographical coordinates** of all the centroids were then exported to a *CSV* file so that we could later use them to call the FourSquare API.

* Now, one problem we faced was the fact that the **horizontal spacing** that we used between cells was **2km**, but it turned out that this distance was NOT the diameter of the hexagon cell (which is equal to the diameter of the circle that encloses the hexagon) but the diameter of the inner circle (2*r in the figure below):

<img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/003_hexagon_parts.png" width="400" />

* But knowing d we could calculate R:

In [2]:
import math
# See: http://www.drking.org.uk/hexagons/misc/ratio.html
d = 2000             # This is the horizontal spacing parameter used in QGIS when generating the hexagonal grid
r = d/2              # Inradius of the hexagon
R = r*2 / 3**(1/2)   # Circumradius of the hexagon
D = R*2              # Diameter of the hexagon, which equals the actual distance between cells
R = math.ceil(R)     # We want this number rounded-up
print('\nGeographic grid:')
print('R            : '+str(R))


Geographic grid:
R            : 1155


* ... And that is the diameter we used for our circular cells to cover the whole city.
* After importing all centroids from the CSV (generated by QGIS) and displaying them on a map as circles, we obtained the following:

<p float="left" align="middle">
  <img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/004_bogota_grid_circles01.png" width="374" />
  <img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/005_bogota_grid_circles02.png" width="445" /> 
</p>

* As you can see from the zoomed-in image on the right, the circles **overlap** each other a little bit.
* This **cannot be avoided** since it is all we can do when using circular cells in a grid.
* So we need to keep in mind that some results that we might get from FourSquare could be **duplicated**.
* But since all results do have their own unique **id** it will be easy to remove duplicated venues.
* Again, this is something to keep in mind.

* Ok... time to call the FourSquare API.
* To do so we will use the **search** endpoint (we do not want top or trending venues BUT all venues around a point).
* Since the client wants to target *highly-educated working professionals the advertisers can be interested in*, we need to locate ONLY venues that can fall under categories that could be associated to this target.
* The whole list of categories that were used is (the strings on the left are the ids):

```python
    '4bf58dd8d48988d124941735',  # office
    '56aa371be4b08b9a8d573517',  # business center
    '4bf58dd8d48988d1ff931735',  # convention center
    '4eb1daf44b900d56c88a4600',  # fair
    '52f2ab2ebcbc57f1066b8b56',  # atm
    '4bf58dd8d48988d10a951735',  # bank
    '5453de49498eade8af355881',  # business service
    '5032850891d4c4b30a586d62',  # credit union
    '5744ccdfe4b0c0459246b4be',  # currency exchange
    '503287a291d4c4b30a586d65',  # financial or legal service
    '5ae95d208a6f17002ce792b2',  # notary
    '4bf58dd8d48988d130941735',  # building
    '4bf58dd8d48988d126941735',  # government building
    '52e81612bcbc57f1066b7a32',  # cultural center
    '58daa1558bbb0b01f18ec1b2',  # research station
```

* We ran a search query on ALL cells in our grid.
* In the process we created a table were all results were merged and cleaned-up.
* One of the things we did was to make sure no duplicate venues were found in the table.
* We got 3320 venues in total out of which **2776** were unique (not duplicated).
* Some of the rows in this table can be seen below:

<img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/006_bogota_venues_found_by_foursquare.png" width="800" />
<br>

* But... how many of these venues belonged to each of the categories selected?
* Out of the results obtained, we came up with the following table that answers this question (we are showing only the top results in the table):


<img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/007_bogota_venues_found_by_foursquare__categories.png" width="300" />


* And where are all those venues located in the map?
* We used **Folium FastMarkerCluster plugin** for this task since the number of markers was simply too high.
* This plugin generated the following map:

<p float="left" align="middle">
  <img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/008_bogota_grid_venue_markerclusters01.png" width="379" />
  <img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/009_bogota_grid_venue_markerclusters02.png" width="440" /> 
</p>

* We could see that venues were all over the place throughout the city.
* The above map did not help us to get a good picture of which areas could have higher densities of these types of venues.
* So we decided to make use of the **Folium HeatMap plugin**.
* Thanks to this approach, we could generate the following map:

<img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/010_bogota_grid_venue_heatmap.png" width="800" />


* The above is a much better map as we can easily identify which areas have higher densities.
* So now we need to select those areas but not visually but mathematically.
* Since **DENSITY** is what we are interested in, and we also want to avoid all the **NOISE** from outliers (venues that are not close to other venues), we decided to use the **DBSCAN** model.
* DBSCAN finds clusters for us, but we can NOT specify the number of clusters to be returned.
* The number of clusters returned depends on the way we ask DBSCAN to locate high-density areas.
* There are 2 parameters to play with: eps and min_samples.
* We kept min_samples constant (20) and changed eps from 0.04 to 0.20.
* The different **number of clusters** that we obtained from DBSCAN on each of these runs can be shown in this graph:

<img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/011_bogota_venue_clusters_found_by_dbscan_per_eps.png" width="400" />


&#x25cf;&#x2001;The client is looking to get a moderate number of ideal distribution locations.  
&#x25cf;&#x2001;After talking to them, they tell us that a number around **20** locations would be good to start with.  
&#x25cf;&#x2001;So we pick `eps=0.09` and `min_samples=20` which give us **20 clusters**.  
&#x25cf;&#x2001;The clusters provided by DBSCAN are shown in the following map:  
&#x2001;&bull;&#x2001;**NOTE** *The image on the right shows a zoomed-in version with the heatmap in the background. You can see that the clusters are indeed located in those hotter areas*

<p float="left" align="middle">
  <img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/012_bogota_venue_clusters_found_by_dbscan.png" width="253" />
  <img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/013_bogota_venue_clusters_found_by_dbscan.png" width="566" /> 
</p>

* This was our **initial solution**.
* The problem with it is that as you can see from the heatmap, there is *heat* all over the place.
* In other words... there is a **lot of noise**; the venues of the specified categories are everywhere in the city.
* So in order to come up with better recommendations, we decided to **reduce the noise**.


* The most trivial approach was to **remove some of the categories** so that FourSquare would return less venues.
* We decided to remove the **"Building"** category since it was one of the most frequent categories on our list of results, and the category itself is too vague.
* After doing so, the number of venues returned by FourSquare was reduced from 2776 down to just 607.
* Again, we used the Folium FastMarkerCluster plugin to locate the venues on the map:

<p float="left" align="middle">
  <img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/014_bogota_grid_venue_markerclusters01.png" width="340" />&#x2001;<img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/015_bogota_grid_venue_markerclusters02.png" width="479" /> 
</p>

| AAAA | BBB |
|:---:|:---:|


| AAAA | BBB |
|:---:|:---:|
|   |   |

* And we also generated a heatmap to visualize high-density areas (more venues of interest):

<img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/016_bogota_grid_venue_heatmap.png" width="800" />


* The next step was to run DBSCAN with min_samples constant (20) and eps values from 0.04 to 0.20.
* The problem we found is that due to the fact that venues were now at higher distance between each other, the min_samples parameter was a condition difficult to meet, and thus DBSCAN was returning very **low numbers of clusters** (from 0 to 5).
* Since we needed a higher number of clusters we **decreased min_samples down to 5**.
* After we did that, DBSCAN started to dicover more clusters:

<img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/017_bogota_venue_clusters_found_by_dbscan_per_eps.png" width="400" />


&#x25cf;&#x2001;So we picked `eps=0.09` and `min_samples=5` which gave us **20 clusters**.  
&#x25cf;&#x2001;The clusters provided by DBSCAN are shown in the following map:  
&#x2001;&bull;&#x2001;**NOTE** *The images show the heatmap in the background. You can see that the clusters are indeed located in those hotter areas*

<p float="left" align="middle">
  <img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/021_bogota_venue_clusters_found_by_dbscan.png" width="341" />     
  <img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/020_bogota_venue_clusters_found_by_dbscan.png" width="478" /> 
</p>    
<p float="left" align="middle">
  <img src="https://raw.githubusercontent.com/AleCaste/coursera-applied_data_science_capstone/master/week05_01_ex/019_bogota_venue_clusters_found_by_dbscan.png" width="820" />    
</p>

* After checking these maps we were satisfied with the recommended locations.

<a name="results_hd5we7"><br><br><br><br><br><br>

---

## Results &#x2001;&#x2001;&#x2001;&#x2001;<a href="#toc_hd5we7">&#x21A5;</a>&#x2001;&#x2001;

* Our first approach to get a list of recommended distribution locations faced the problem of **not being specific enough**.
* We were dealing with too many **points of interest** due to the fact that our definition of which locations could be *interesting* was too broad.
* This resulted in having to deal with a lot of noise and thus getting a finite list of specific recommended locations was **not trustworthy enough**.
* We determined those locations by finding those areas with **high density of points of interests** (matching the target profile the client was looking for).


* Our second approach resulted to be much better since we manage to **reduce the number of points of interest**.
* We did so by re-defining which locations could be *interesting*, being much more **specific** in this case.
* This resulted in much less amount of noise and thus a much more reliable finite list of recommended locations to distribute the newspaper.
* In the end we came up with a list of **20 locations** which is exactly the number the client asked for.

<a name="discussion_hd5we7"></a><br><br><br><br><br><br>

---

## Discussion &#x2001;&#x2001;&#x2001;&#x2001;<a href="#toc_hd5we7">&#x21A5;</a>&#x2001;&#x2001;

* The area covered is around **30$km^2$** which is a bit too large.
* Since the recommended addresses cover the whole area and are not specifically located in any particular part of the city, maybe it would be difficult for the client to put all those distribution locations to work. The **logistics could be a bit complicated**.
* So we would recommend the client to make the **area of interest smaller**, focusing on -let's say- **half** of it.
* Determining were in the city we would place this half would imply further study.
* And once that would be determined we could perform the calculations again to get a new curated list of recommended distribution locations within the new smaller area of choice. That would make logistics easier.

<a name="conclusion_hd5we7"></a><br><br><br><br><br><br>

---

## Conclusion &#x2001;&#x2001;&#x2001;&#x2001;<a href="#toc_hd5we7">&#x21A5;</a>&#x2001;&#x2001;

* This report would be the **starting point** to help the client decide which would be *apropriate locations in Bogota to start distributing the free newspaper* to their target readers (highly-educated working professionals the advertisers can be interested in).
* The list of recommended locations does NOT include **specific addresses** but **zones** within the city. It will be up to the client to determine the exact address inside each zone or in the near surroundings depending on the specifics of the logistics and some other factors like proximity to well-known buildings, public transportation accesses etc (some of these conditions could be incorporated in further calculations should the client ask for it).

<br><br><br>