# Capstone Project - The Batle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera
Many of the script/code in this project is derived from the class lecture and the sample Capstone Project.

## Table of contents
*  [Introduction: Business Problem](#introduction)
*  [Data](#data)
*  [Methodology](#methodology)
*  [Analysis](#analysis)
*  [Results and Discussion](#results)
*  [Conclusion](#conclusion)

## Introduction: <a name="introduction"></a>

#### Background & Business Problem

Amazon announced that they will open second headquarter (HQ2) in **National Landing**, a future neighborhood including **Crystal City** in **Arlington, Virginia** brining upwords of 25,000 workers. In addition there can be 2 to 3 times that number indirectly supporting Amazon HQ2. For any potential entrepreneur, this provides a great opportunity to open new business catering to the influx of new employees.

In this project, we will try to find a optimal restaurant type to open serving the potential influx of new personnel moving into the area serving the Amazon HQ2.

## Data<a name="data"></a>

Based on the definition of our problem, factors that will influence our decision are:
- Number of existing restaurant types in the neighborhood.
- Distance of neighborhood from the Amazon HQ2 center.

Following data sources will be used to extract/generate the required information:
- centers of candidate areas will be generated algorithmically using **Google Maps API reverse geocoding**.
- number of restaurants types and locations in each neighborhood will be obtained using **Foursquare API**.
- coordinate of Amazon HQ2 will be obtained using **Google Maps API geocoding** of **National Landing**.

## Methodology<a name="methodology"></a>

This project will look for the potential restaurant location within 1,500 meters (comfortable walking distance accomodating lunch outing) from proposed Amazon HQ2 in **National Landing** in **Crystal City, Virginia**. <br>
<br>
The optimal restaurant type will be defined as the most frequent type between 1500 to 6000 meters from Amazon HQ2 that is not located within 1500 meters. <br>
<br>
Each neighborhood within the 1,500 meters and 6,000 meters from proposed Amazon HQ2 will be segmented by 300 meter radius. <br>
<br>
Once the optimal restaurant type is identified, the optimal location will be identified as the one with fewest restaurants.

### Data Build

#### Importing libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analysis

import json # Library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests

import folium # map rendering library
import shapely.geometry
import pyproj
import math
import pickle

#### Use geopy library to get the latitude and logitude values of National Landing.
In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent nat_explorer, as shown below.

In [2]:
address = 'National Landing, Virginia'
geolocator = Nominatim(user_agent='nat_explorer')
location = geolocator.geocode(address)
nl_latitude = location.latitude
nl_longitude = location.longitude
print('The geograpical coordinate of National Landing are {}, {}.'.format(nl_latitude, nl_longitude))

The geograpical coordinate of National Landing are 38.8548783, -77.0517428.


Now let's create a grid of area candidates, equaly spaced, centered around Amazon HQ2 center and within ~6,000 meters. Our neighborhoods will be defined as circular areas with a radius of 300 meters, so our neighborhood centers will be 600 meters apart.

To accurately calculate distances we need to create our grid of locations in Cartesian 2D coordinate system which allows us to calculate distances in meters (not in latitude/longitude degrees). Then we'll project those coordinates back to latitude/longitude degrees to be shown on Folium map. So let's create functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in  meters).

Making function to convert latitude and longitude to a x, y position and calculate distances.

In [37]:
def latlon_to_xy(lat, lon):
    project_latlon = pyproj.Proj(proj = 'latlong', datum = 'WGS84')
    project_xy = pyproj.Proj(proj = 'utm', zone = 18, datum = 'WGS84')
    xy = pyproj.transform(project_latlon, project_xy, lon, lat)
    return xy[0], xy[1]
    
def xy_to_latlon(x, y):
    project_latlon = pyproj.Proj(proj = 'latlong', datum = 'WGS84')
    project_xy = pyproj.Proj(proj = 'utm', zone = 18, datum = 'WGS84')
    latlon = pyproj.transform(project_xy, project_latlon, x, y)
    return latlon[0], latlon[1]

def cal_xy_dist(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx**2 + dy**2)

print('Coordinate transformation check')
print('------------------------------------------------------------------------------')
print(f'National Landing Center latitude = {nl_latitude}, longitude = {nl_longitude}')
nl_x , nl_y = latlon_to_xy(nl_latitude, nl_longitude)
print(f'National Landing Center UTM x = {nl_x}, y = {nl_y}')
lon, lat = xy_to_latlon(nl_x, nl_y)
print(f'National Landing Center latitude = {lat}, longitude = {lon}')
if (round(nl_latitude, 7) == round(lat, 7)) and (round(nl_longitude, 7) == round(lon, 7)):
    print("Coordinate transformation checks")
else:
    print("Coordinate did not transform properly")

Coordinate transformation check
------------------------------------------------------------------------------
National Landing Center latitude = 38.8548783, longitude = -77.0517428
National Landing Center UTM x = 321965.45553396875, y = 4302672.739967278
National Landing Center latitude = 38.85487830000001, longitude = -77.0517428
Coordinate transformation checks


Let's create a **hexagonal grid of cells**: we offset every other row, and adjust vertical row spacing so that **every cell center is equally distant from all it's neighbors**.

In [5]:
k = math.sqrt(3) / 2
x_min = nl_x - 6000
x_step = 600
y_min = nl_y - 6000 - (int(21/k)*k*600 - 12000)/2
y_step = 600 * k
latitudes = []
longitudes = []
dists_ctr = []
xs = []
ys = []

for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 300 if i%2==0 else 0
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        dist_ctr = cal_xy_dist(nl_x, nl_y, x, y)
        if (dist_ctr <= 6001):
            lon, lat = xy_to_latlon(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            dists_ctr.append(dist_ctr)
            xs.append(x)
            ys.append(y)

print(f'{len(latitudes)}, neighborhood centers generated.')

364, neighborhood centers generated.


Let's create a list from 1 to 364 to identify the 364 different neighborhood sections

In [94]:
sections = list(range(1, 365))

In [116]:
df_neighborhood = pd.DataFrame(zip(sections, latitudes, longitudes),
                               columns = ['Section', 'Latitude', 'Longitude'])

Let's display the map. The *green* circle will depict the 1,500 meters from Amazon HQ2; the *red* circule depicts the 6,000 meters from Amazon HQ2.

In [95]:
map_national1 = folium.Map(location = [nl_latitude, nl_longitude], zoom_start = 12)

folium.Circle([nl_latitude, nl_longitude], radius = 1500, color = 'green', fill = False,
             ).add_to(map_national1)
folium.Circle([nl_latitude, nl_longitude], radius = 6000, color = 'red', fill = False,
             ).add_to(map_national1)


map_national1

We'll add the *blue* circle depicting 300 meter radius of neighborhoods. These will be the neighborhood candidates. The neighborhood number will be identified in *popup* and *tooltip* of the map.

In [96]:
map_NL = folium.Map(location = [nl_latitude, nl_longitude], zoom_start = 12)
for lat, lon, sec in zip(latitudes, longitudes, sections):
    folium.Circle([lat, lon], radius = 300, color = 'blue', fill = False,
                 popup = sec, tooltip = sec).add_to(map_NL)
folium.Circle([nl_latitude, nl_longitude], radius = 1500, color = 'green', fill = False,
             ).add_to(map_NL)
folium.Circle([nl_latitude, nl_longitude], radius = 6000, color = 'red', fill = False,
             ).add_to(map_NL)

map_NL

We can zoom into the green circle and identify all the neighborhood inside of the green circle. They are listed below.

In [14]:
close_proximity = [240, 241, 242,220, 221, 222, 223, 200, 201, 202, 203, 204, 180, 181, 182, 183, 184, 185,
                   161, 162, 163, 164, 165, 142, 143, 144, 145, 123, 124, 125]

In [117]:
df_neigh_near = pd.DataFrame(columns = ['Section', 'Latitude', 'Longitude'])
df_neigh_far = df_neighborhood[:]

In [118]:
for section in close_proximity:
    if len(df_neighborhood[df_neighborhood.Section == section]) == 0:
        pass
    else:
        df_neigh_near = df_neigh_near.append(df_neighborhood[df_neighborhood.Section == section], 
                                             ignore_index = True)
        df_neigh_far.drop(df_neighborhood[df_neighborhood.Section == section].index, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


#### Define Foursquare Credentials and Version

In [15]:
CLIENT_ID = 'RRFO4M4S3W3VH5TOFYJVEWXKRWZ2U5OOMROJPWGOHK5WBQP0' # your Foursquare ID
CLIENT_SECRET = 'UFYPKIV0ORBRF1RWBOWEC34JUZB1W5ECS3AJPSXP3FWMY2T5' # your Foursquare Secret
VERSION = '20191225' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: RRFO4M4S3W3VH5TOFYJVEWXKRWZ2U5OOMROJPWGOHK5WBQP0
CLIENT_SECRET:UFYPKIV0ORBRF1RWBOWEC34JUZB1W5ECS3AJPSXP3FWMY2T5


#### Now, let's get up to 10,920 (30 venues per each of 364 neighborhoods) venues that are within a radius of 6000 meters
We limited to 30 venues due to Foursquare Sandbox version limitation. From the Foursquare documentation, restaurants category id is 4d4b7105d754a06374d81259

In [20]:
radius = 300
LIMIT = 30
category = '4d4b7105d754a06374d81259'
urls = []
for lat, lon in zip(latitudes, longitudes):
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&v=20191225&ll={},{}&radius={}&categoryId={}&limit{}'.format(
        CLIENT_ID, CLIENT_SECRET, lat, lon, radius, category, LIMIT)
    urls.append(url)

Send the GET request and examine the results

In [22]:
name = []
category = []
lat = []
lon = []
section = []
i = 1
for url in urls:
    results = requests.get(url).json()
    venues = results['response']['venues']
    for venue in venues:
        name.append(venue['name'])
        lat.append(venue['location']['lat'])
        lon.append(venue['location']['lng'])
        category.append(venue['categories'][0]['shortName'])
        section.append(i)
    i += 1

Let's create data frame of the results and export it (This is so that we don't have to rerun the Foursquare API).

In [23]:
df = pd.DataFrame(zip(name, category, lat, lon, section),
                 columns = ['Name', 'Category', 'Latitude', 'Longitude', 'Section'])

In [40]:
df.to_pickle('national_landing.pkl')

Let's separate the near (1,500 meters) and the far (between 1,500 meters and 6,000 meters)

In [30]:
df_near = pd.DataFrame(columns = ['Name', 'Category', 'Latitude', 'Longitude', 'Section'])
df_far = df[:]

In [31]:
for section in close_proximity:
    if len(df[df.Section == section]) == 0:
        pass
    else:
        df_near = df_near.append(df[df.Section == section], ignore_index = True)
        df_far.drop(df[df.Section == section].index, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


## Analysis <a name="analysis"></a>

Let's take a look at initial results.

In [60]:
print(f'There are total of {len(df)} restaurants within 6,000 meters of proposed Amazon HQ2 location')
print(f'There are total of {len(df_far)} restaurants between 1,500 and 6,000 meters of proposed Amazon HQ2 location')
print(f'There are total of {len(df_near)} restaurants within 1,500 meters of proposed Amazon HQ2 location')

There are total of 2996 restaurants within 6,000 meters of proposed Amazon HQ2 location
There are total of 2669 restaurants between 1,500 and 6,000 meters of proposed Amazon HQ2 location
There are total of 327 restaurants within 1,500 meters of proposed Amazon HQ2 location


Let's take a look at the different types of restaurants within each of the two area.

In [67]:
print(f"There are total of {len(df['Category'].unique())} unique category of restaurants within 6,000 meters of proposed Amazon HQ2 location.")
print(f"There are total of {len(df_far['Category'].unique())} unique category of restaurants between 1,500 and 6,000 meters of proposed Amazon HQ2 location.")
print(f"There are total of {len(df_near['Category'].unique())} unique category of restaurants within 1,500 meters of proposed Amazon HQ2 location.")

There are total of 157 unique category of restaurants within 6,000 meters of proposed Amazon HQ2 location.
There are total of 154 unique category of restaurants between 1,500 and 6,000 meters of proposed Amazon HQ2 location.
There are total of 64 unique category of restaurants within 1,500 meters of proposed Amazon HQ2 location.


Let's create data frames of restaurants types and the number of each that exists overall; between 1,500 meters and 6,000 meters; within 1,500 meters of Amazon HQ2 proposed location.

In [41]:
dict_rest_cat = {}           # Dictionary of all the restaurant categories
dict_rest_cat_far = {}       # Dictionary of restaurant categories between 1,500 meters to 6,000 meters
dict_rest_cat_near = {}      # Dictionary of restaurant categories withing 1,500 meters

In [42]:
for rest_cat in df['Category']:
    if rest_cat in dict_rest_cat:
        dict_rest_cat[rest_cat] += 1
    else:
        dict_rest_cat[rest_cat] = 1

for rest_cat in df_far['Category']:
    if rest_cat in dict_rest_cat_far:
        dict_rest_cat_far[rest_cat] += 1
    else:
        dict_rest_cat_far[rest_cat] = 1
        
for rest_cat in df_near['Category']:
    if rest_cat in dict_rest_cat_near:
        dict_rest_cat_near[rest_cat] += 1
    else:
        dict_rest_cat_near[rest_cat] = 1

In [46]:
df_rest_cat = pd.DataFrame(columns = ['Category', 'Count'])
df_rest_cat_near = pd.DataFrame(columns = ['Category', 'Count'])
df_rest_cat_far = pd.DataFrame(columns = ['Category', 'Count'])

In [51]:
for key in dict_rest_cat:
    df_rest_cat = df_rest_cat.append({'Category':key, 'Count':dict_rest_cat[key]}, ignore_index = True)
df_rest_cat.sort_values(['Count'], ascending = False, inplace = True)

In [52]:
for key in dict_rest_cat_far:
    df_rest_cat_far = df_rest_cat_far.append({'Category':key, 'Count':dict_rest_cat_far[key]}, ignore_index = True)
df_rest_cat_far.sort_values(['Count'], ascending = False, inplace = True)

In [53]:
for key in dict_rest_cat_near:
    df_rest_cat_near = df_rest_cat_near.append({'Category':key, 'Count':dict_rest_cat_near[key]}, ignore_index = True)
df_rest_cat_near.sort_values(['Count'], ascending = False, inplace = True)

Let's take a look at the most frequent type of restaurants between 1,500 and 6,000 meters from proposed Amazon HQ2 location.

In [55]:
df_rest_cat_far.head()

Unnamed: 0,Category,Count
11,Food Truck,222
4,Coffee Shop,195
3,American,184
8,Sandwiches,140
23,Café,137


Let's identify the restaurants types that are between 1,500 to 6,000 meters from Amazon HQ2 site that is not within 1,500 meters.

In [56]:
dict_notin_near = {}
for category in dict_rest_cat_far:
    if category not in dict_rest_cat_near:
        dict_notin_near[category] = dict_rest_cat_far[category]

In [57]:
df_notin_near = pd.DataFrame(columns = ['Category', 'Count'])
for key in dict_notin_near:
    df_notin_near = df_notin_near.append({'Category':key, 'Count':dict_notin_near[key]}, ignore_index = True)
df_notin_near.sort_values(['Count'], ascending = False, inplace = True)

In [150]:
df_notin_near

Unnamed: 0,Category,Count
9,French,28
3,Tacos,27
32,Latin American,26
8,Southern / Soul,16
4,Korean,13
17,Vegetarian / Vegan,13
15,Falafel,9
27,Grocery Store,9
23,Peruvian,9
30,Spanish,9


The top 5 restaurant categories that are between 1,500 to 6,000 meters that is not within 1,500 meters of Amazon HQ2 proposed locations are French, Tacos, Latin America, Southern/Soul, and Korean.

Let's identify the potential locations.

In [70]:
df_near_piv = df_near.groupby(['Section']).count()

In [100]:
df_near_piv.reset_index(inplace = True)

In [101]:
df_near_piv.sort_values(['Name'], inplace = True)

In [102]:
df_near_piv = df_near_piv[['Section', 'Category']]

In [104]:
df_near_piv.rename(columns = {'Category':'Count'}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)


In [106]:
df_near_piv

Unnamed: 0,Section,Count
0,123,1
13,200,1
6,162,1
3,143,1
4,144,1
22,242,3
20,240,4
2,142,7
19,222,7
1,124,7


In [120]:
df_near_neigh = pd.merge(df_neigh_near, df_near_piv, how = 'left', on = 'Section')

In [121]:
df_near_neigh.fillna(0, inplace = True)

In [131]:
df_near_neigh.sort_values(['Count'])

Unnamed: 0,Section,Latitude,Longitude,Count
29,125,38.840961,-77.04443,0.0
6,223,38.864419,-77.041645,0.0
18,161,38.849955,-77.065428,0.0
17,185,38.855181,-77.034466,0.0
13,181,38.854696,-77.062109,0.0
11,204,38.8598,-77.038055,0.0
12,180,38.854573,-77.069019,0.0
27,123,38.840718,-77.058249,1.0
25,144,38.84558,-77.048019,1.0
24,143,38.845458,-77.054929,1.0


Based on the result above, neighborhood section 125, 223, 161, 185, 181, 204, 180 has no restaurants at all. Let's look at it on a  map. The size of circle will depict the number of restaurants in that neighborhood.

In [123]:
lat_near = df_near_neigh['Latitude'].tolist()
lon_near = df_near_neigh['Longitude'].tolist()
sec_near = df_near_neigh['Section'].tolist()
ct_near = df_near_neigh['Count'].tolist()

In [133]:
map_NL = folium.Map(location = [nl_latitude, nl_longitude], zoom_start = 12)

for lat, lon, sec, ct in zip(lat_near, lon_near, sec_near, ct_near):
    if sec in close_proximity:
        folium.Circle([lat, lon], radius = (ct+2)*5, color = 'blue', fill = True,
                     popup = sec, tooltip = ct).add_to(map_NL)
        

folium.Circle([nl_latitude, nl_longitude], radius = 1500, color = 'green', fill = False,
             ).add_to(map_NL)


map_NL

Of the 7 sections with no restaurants, we can rule out using section 125, 223, 185, and 204 due to it being in the airport. That leaves section 161, 180, and 181 as the potential site.

In [144]:
solution0 = [161, 180, 181]
solution1 = [123, 143, 144, 162, 200]

In [147]:
map_NL = folium.Map(location = [nl_latitude, nl_longitude], zoom_start = 14)

for lat, lon, sec in zip(latitudes, longitudes, sections):
    if sec in solution0:
        folium.Circle([lat, lon], radius = 300, color = 'blue', fill = True,
                     popup = sec, tooltip = ct).add_to(map_NL)
    elif sec in solution1:
        folium.Circle([lat, lon], radius = 300, color = 'purple', fill = True,
                     popup = sec, tooltip = ct).add_to(map_NL)
        

folium.Circle([nl_latitude, nl_longitude], radius = 1500, color = 'green', fill = False,
             ).add_to(map_NL)


map_NL

## Results and Discussion <a name="results"></a>

This study's results shows that restaurant types that are most frequent between 1,500 and 6,000 meters from proposed Amazon HQ2 site that is not available within 1,500 meters are French, Tacos, latin American, Souther / Soul, and Korean food. <br>
<br>
Of the 30 potential location inside 1,500 meters from proposed Amazon HQ2 location, there are 7 locations with 0 restaurants. Four of those location is in the airport. Three locations remained (161, 180, 181). <br>
<br>
What this study did not consider is the zoning of the neighborhoods. The resulting 3 location may be zoned as a residential area and may not allow a restaurants to be opened. This study did not seek out the necessary data to answer that specific question. <br>
<br>
If this study did have those information, then it could have factored those in and possibly looked at other area. <br>
Other potential area are neighborhood section 123, 143, 144, 162, 200 (purple shaded circle). These sections only have one restaurants.

## Conclusion <a name="conclusion"></a>

Purpose of this project was to first identify an optimal type of restaurants to open to cater to the potential influx of up to 75,000 people moving into National Landing area to either directly or indirectly work at Amazon HQ2. Secondly, this project was to identify the optimal location to open said restaurant. <br>
<br>
This study had narrowed the type of restaurant down to 5 and potential location to 8 different neighborhood sections within 1,500 meters from proposed Amazon HQ2 location. Ultimately, the stakeholders must make decision based on their specific needs.