# Applied Data Science Capstone

## Introduction

In my scenario, I am interested in finding an ideal neighborhood for a late-night French Fry Boutique. Even in Atlanta, there are few late night food options and there is potential for demand. This hypothetical french fry boutique would feature items ranging from about $5-15 and we will focus on snack and comfort food rather than entrees, for the sake of defining our business.

To guide enquiry, we will lead with a few hypotheses and verify them as best as we can with available data.
### Positive Influencers
* Proximity to younger areas, high schools, and colleges
* Affluent area
* Local bars and late-night attractions
* Lots of pedestrians
    * Area should be known as generally safe
    * Pedestrians new to the store may be more willing to stop in and grab a snack  
    
### Negative Influencers  

* Fast Food
    * May lose potential new customers if they compare our fries 'with the works' to a chain's $0.99 box of fries

## Data

We will be using several datasets to evaluate our hypotheses above and answer our ultimate question.
* NPU json to map the statistical regions (NPU.json)
* Data concerning various statistics about the neighborhoods (NPU_DATA.csv)
* Data gathered from Foursquare (venues.csv)
    * Discover information about potential competitors to confirm, deny, or modify our hypotheses
    * Determine companion shops via clustering to generate viable neighborhoods
* [Crime data](http://www.atlantapd.org/i-want-to/crime-data-downloads) to evaluate safety (COBRA-2019.csv)

In [1]:
import pandas as pd
import os

In [10]:
npu = os.path.join('Data', 'NPU_DATA.csv')
crime = os.path.join('Data', 'COBRA-2019.csv')
venues = os.path.join('Data', 'venues.csv')

venues_df = pd.read_csv(venues)
venues_df = venues_df.drop('Unnamed: 0', 1)
npu_df = pd.read_csv(npu)
npu_df = npu_df.drop('Unnamed: 0', 1)
crime_df = pd.read_csv(crime)

In [11]:
npu_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 13 columns):
NPU                                                                 102 non-null object
Count, Total population, 2015                                       102 non-null int64
Median, Median age (years), 2015                                    101 non-null float64
Median, Median value of owner-occupied unit (dollars), 2015         101 non-null float64
Median, Median gross rent (dollars), 2015                           101 non-null float64
Median, Median household Income, 2015                               100 non-null float64
Percent, Public transportation (excluding taxicab) to work, 2015    101 non-null float64
Percent, Walked to work, 2015                                       101 non-null float64
Percent, Other means to work, 2015                                  101 non-null float64
Crime Count                                                         102 non-null int64
Neighborho

In [12]:
venues_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3658 entries, 0 to 3657
Data columns (total 7 columns):
Neighborhood              3658 non-null object
Neighborhood Latitude     3658 non-null float64
Neighborhood Longitude    3658 non-null float64
Venue                     3658 non-null object
Venue Latitude            3658 non-null float64
Venue Longitude           3658 non-null float64
Venue Category            3658 non-null object
dtypes: float64(4), object(3)
memory usage: 200.1+ KB


In [13]:
crime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6349 entries, 0 to 6348
Data columns (total 19 columns):
Report Number              6349 non-null int64
Report Date                6349 non-null object
Occur Date                 6349 non-null object
Occur Time                 6349 non-null object
Possible Date              6349 non-null object
Possible Time              6349 non-null int64
Beat                       6343 non-null float64
Apartment Office Prefix    164 non-null object
Apartment Number           1119 non-null object
Location                   6349 non-null object
Shift Occurrence           6349 non-null object
Location Type              5916 non-null object
UCR Literal                6349 non-null object
UCR #                      6349 non-null int64
IBR Code                   6349 non-null object
Neighborhood               6115 non-null object
NPU                        6346 non-null object
Latitude                   6349 non-null float64
Longitude                  6349

I've overlaid the crime data onto the statistical regions.

In [14]:
import IPython
crime_map = os.path.join('Figures', 'atl_crime_map.html')
IPython.display.IFrame(crime_map, 1000, 1000)

## Methodology

### Safety

We evaluated a neighborhood's safety by counting the number of violent crimes in a region. Because we are interested in the perceived danger of a region, the total number of violent crimes is more relevant than the per capita representation. We then had to broadcast the number of crimes down to the individual neighborhoods, which subdivide the neighborhood planning units.

### Attractions

We were able to determine the attractions for each neighborhood by first using Google Maps API to determine the representative coordinates for each neighborhood. We then used the representative point to find all attractions within a certain radius (1 km). We determined compatible attractions and used those to define a classification of whether an area was a good candidate for our restaurant.

### Analysis

We scaled the data so it could be used by our machine learning algorithms. We began by trying a similar procedure as the Toronto neighborhoods project. Unfortunately, we were unable to generate good clusters when looked at with silhouette analysis. The solution was to refine the question to a classification problem rather than clustering. The similar venue categories were determined to be: 'Gastropub', 'Burger Joint', 'Candy Store', 'Dive Bar', 'Fish & Chips Shop', 'Fried Chicken Joint'. Notable exceptions are various cultural and fast food categories. By training a K-Nearest Neighbors model on the combined Neighborhoods and venues datasets, we could filter most of the neighborhoods. Finally, we gathered the result by assigning importance to low age, low crime, and high income.

## Results

In [17]:
resultpath = os.path.join('Data', 'Result.csv')
result = pd.read_csv(resultpath)
result = result.drop('Unnamed: 0', 1)
result

Unnamed: 0,Name,Crime Count,Population,Median Age,Median House Value,Median Rent,Median Household Income,Public Transit,Walk,Other transit
0,Atlanta University Center,177,7091,20.5,156008.0,829.0,27742.0,0.12,0.297,0.016
1,Thomasville Heights,136,2988,21.1,82332.0,736.0,10480.0,0.242,0.005,0.0
2,Home Park,587,5420,24.1,211485.0,1248.0,44207.0,0.048,0.225,0.041
3,Atlantic Station,587,4120,27.1,257667.0,1190.0,46956.0,0.028,0.056,0.033
4,Browns Mill Park,136,4661,29.7,83727.0,820.0,22507.0,0.25,0.001,0.02


These results yield areas near 