# Capstone Project-Battle of Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

By Brandon Risley
(July 20, 2020)

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results & Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

The wise Benjamin Franklin once said, “in this world, nothing can be said to be certain, except death and taxes.” While true, Mr. Franklin did not have the foresight to see the only other certainty in the world; coffee! Americans are obsessed with coffee. According to the National Coffee Association, 64% of Americans consume coffee every day and on average, each American consumes 3.1 cups of coffee a day [1]. With over 400 million cups of coffee sold every day in the United States, it is no surprise that the coffee industry has reached a market value of $45.4 billion [1,2]. The demand for coffee has created a new era gold rush. Since 2019, there has been a 3.8% increase in the number of coffee shops, bringing the total number of coffee outlets to approximately 35,616 [1,2]. 

Young entrepreneurs eager to take a slice of the coffee profit must choose their opening location wisely. The established competition can prove to be a major barrier of entry for new stores. According to Statista, 92% of coffee consumers feel some degree of brand loyalty [3]. Hence, opening a new coffee shop in an already saturated area could prove detrimental to the shop’s success. Furthermore, new coffee shop owners should consider the population of the area because this indirectly translates to potential customers. While there are many other factors to consider, such as local taxation policies and cost of rent, an often overlooked component to selecting a business location is the crime rate. High crime areas can potentially scare customers from the business. Additionally, crime dense areas increase the chances of property damage, which leads to increased operational costs. 

As a young entrepreneur who loves money and coffee, I decided to investigate this idea further. I currently live in Atlanta, Georgia, and sought out to find the best neighborhood in Atlanta to open a coffee shop. 


## Data <a name="data"></a>

To find the optimal neighborhood for my new coffee business (perhaps Brandon’s Better Brew is a fitting name!), I found publicly available data from three services: The Atlanta Regional Commission, The Atlanta Police Department, and FourSquare.

**The Atlanta Regional Commission (ARC)**: 

The  ARC provided the geojson data for the neighborhoods in Atlanta. Though Atlanta officially consists of 243 neighborhoods, the ARC grouped similar neighborhoods (particularly those away from the city center) together. For example, Buckhead Forest and South Tuxedo Park are considered different neighborhoods, but the ARC combines them into one larger neighborhood. In addition to the neighborhood geojson data, the ARC dataset provided neighborhood populations as sourced from the 2010 census. In summary, the ARC dataset provides:
1. Geojson data to create a map of Atlanta’s neighborhoods
2. Populations of each neighborhood

**The Atlanta Police Department (APD)**:

The APD provided the Atlanta’s crime reports between 2009 and 2019. The seriousness of the crimes varied from petty theft to homicide, but to keep this study as simple as possible, the severity of each crime was not taken into consideration when describing a neighborhood’s crime rate. Only the number of crimes in an area was considered. Unlike the ARC data, each crime was attributed to one of the 243 neighborhoods. In order to match the neighborhoods listed in the ARC dataset, some data manipulation must be undertaken. For example, in the APD dataset, the crimes at South Tuxedo Park and Buckhead Forest are listed separately. To match the ARC dataset, the sum of the crime counts in these neighborhoods is taken and assigned to the neighborhood “Buckhead Forest and South Tuxedo Park” in the ARC dataset. In summary, the APD data provides:
1. The crimes that occurred in each neighborhood

**FourSquare**:

The FourSquare data extracted from the developer API provides the last piece of missing information; coffee locations. Using the neighborhood latitude and longitude coordinates and a radius defined by the relative size of the neighborhood, the number of coffee shops in an area was fetched through FourSquare. In summary, the FourSquare data provides:
1. Number of coffee shops in a neighborhood 


## Methodology <a name="methodology"></a>

### Data Wrangling & Cleaning

This code section installs the the appropriate packages used for this analysis. The primary packages are **pandas, numpy, matplotlib,** and **folium**. 

In [68]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

import matplotlib as mpl
import matplotlib.pyplot as plt


import json, requests

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


In [69]:
#Load the Atlanta PD dataset
df_crime = pd.read_csv('COBRA-2009-2019.csv')
print(df_crime.shape)
df_crime.head(3)

(342914, 19)


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Report Number,Report Date,Occur Date,Occur Time,Possible Date,Possible Time,Beat,Apartment Office Prefix,Apartment Number,Location,Shift Occurence,Location Type,UCR Literal,UCR #,IBR Code,Neighborhood,NPU,Latitude,Longitude
0,90010930,2009-01-01,2009-01-01,1145,2009-01-01,1148.0,411.0,,,2841 GREENBRIAR PKWY,Day Watch,8,LARCENY-NON VEHICLE,630,2303,Greenbriar,R,33.68845,-84.49328
1,90011083,2009-01-01,2009-01-01,1330,2009-01-01,1330.0,511.0,,,12 BROAD ST SW,Day Watch,9,LARCENY-NON VEHICLE,630,2303,Downtown,M,33.7532,-84.39201
2,90011208,2009-01-01,2009-01-01,1500,2009-01-01,1520.0,407.0,,,3500 MARTIN L KING JR DR SW,Unknown,8,LARCENY-NON VEHICLE,630,2303,Adamsville,H,33.75735,-84.50282


The Atlanta PD dataset provides a bunch of information and has 342,914 crimes reported. For now, all we care about are the neighborhoods and crimes. By using pandas groupby and count() methods, we can count the number of crimes that occured in each neighborhood.  

In [70]:
df_crime_count = df_crime[['Neighborhood','UCR Literal']]
df_crime_count = df_crime_count.groupby('Neighborhood',axis =0).count()
df_crime_count.reset_index(inplace=True)

df_crime_count.columns=['Neighborhood','# of Crimes']
df_crime_count['# of Crimes'] = df_crime_count['# of Crimes'].astype(float)
df_crime_count.set_index('Neighborhood', inplace = True)
df_crime_count.head(5)

Unnamed: 0_level_0,# of Crimes
Neighborhood,Unnamed: 1_level_1
Adair Park,2012.0
Adams Park,1504.0
Adamsville,2798.0
Almond Park,850.0
Amal Heights,372.0


Nice! Now we have the number of crimes that occurred in each neighborhood from 2009 to 2019! Remember how in the data section, I mentioned that the APD and ARC data had different neighborhoods? Well, in case you forgot, the APD data set lists crimes at all 243 neighborhoods in Atlanta, while the ARC combines some neighborhoods to form larger groups. For example, in the APD dataset, the crimes at South Tuxedo Park and Buckhead Forest are listed separately, but for the ARC's geojson data, the two are combined. In order to use the information provided by both data sets, we need to combine the crime data to match the grouped neighborhoods in the ARC data set. To do this, we will use a for loop to iterate through both data sets and merge the information. However, before we an do that, we need to take some precautions. Atlanta had a habit of giving neighborhoods similar names. For example, 'Tuxedo Park' and 'South Tuxedo Park' are different neighborhoods! This happens quite often. The distinction needs to be made in order for the for loop to operate properly. If we do not change anything, when we search for 'Tuxedo Park', we will find crime data for both 'Tuxedo Park' and 'South Tuxedo Park'. This problem was fixed easily by adding an extra space after some of the similar neighborhood names. Below, we alter all the similar names in the crime data set to make this distinction.   

In [71]:
neighborhood_list = df_crime_count.index.tolist() #Convert to list

#Add spaces to similar names to make distinctions
neighborhood_list[neighborhood_list.index('Tuxedo Park')] = 'Tuxedo Park  '
neighborhood_list[neighborhood_list.index('Bankhead')] = 'Bankhead  '
neighborhood_list[neighborhood_list.index('Ben Hill')] = 'Ben Hill  '
neighborhood_list[neighborhood_list.index('Bolton')] = 'Bolton  '
neighborhood_list[neighborhood_list.index('Brookwood')] = 'Brookwood  '
neighborhood_list[neighborhood_list.index('Chastain Park')] = 'Chastain Park  '
neighborhood_list[neighborhood_list.index('Greenbriar')] = 'Greenbriar  '
neighborhood_list[neighborhood_list.index('Lakewood')] = 'Lakewood  '
neighborhood_list[neighborhood_list.index('Lenox')] = 'Lenox  '
neighborhood_list[neighborhood_list.index('Paces')] = 'Paces  '
neighborhood_list[neighborhood_list.index('Oakland')] = 'Oakland  '
neighborhood_list[neighborhood_list.index('Fairburn')] = 'Fairburn  '

#Replace with the fixed neighborhood data
df_crime_count['Neighborhoods'] = neighborhood_list
df_crime_count.set_index('Neighborhoods', inplace = True)

Once this is fixed on the crime data set provided by the APD, the city statistical data provided by the ARC must be edited to reflect this change.

In [72]:
#Load the city data
df_city = pd.read_csv('City Data.csv')

#Add a column to the city dataframe to take crime data
zeros = np.linspace(0,0,102)
df_city ['Crime Count'] = zeros
df_city.set_index('OBJECTID', inplace = True)

#Make the distinctions reflect in the city data 
df_city.loc[5,'A']= 'Chastain Park  , Tuxedo Park  '
df_city.loc[53,'A'] = 'Bankhead  , Washington Park'
df_city.loc[56,'A'] ='Arlington Estates, Ben Hill  , Butner/Tell, Elmco Estates, Fairburn  , Fairburn Tell, Fairway Acres, Huntington, Lake Estates, Wildwood Forest'
df_city.loc[50,'A'] = 'Bolton  , Riverside, Whittier Mill Village'
df_city.loc[81,'A'] = 'Ardmore, Brookwood  '
df_city.loc[59,'A'] = 'Greenbriar  '
df_city.loc[47,'A'] = 'Lakewood  , Leila Valley, Norwood Manor, Rebel Valley Forest'
df_city.loc[90,'A'] = 'Buckhead Heights, Lenox  , Ridgedale Park'
df_city.loc[37,'A'] = 'Margaret Mitchell, Paces  , Pleasant Hill'
df_city.loc[100,'A'] = 'Grant Park, Oakland  '
df_city.rename(columns={'A':'Neighborhood'},inplace = True)

#Drop unrelevant information
df_city.drop('NPU',inplace=True, axis = 1)

In [73]:
df_city.head(5)

Unnamed: 0_level_0,POP2010,Neighborhood,Crime Count
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2672,"Arden/Habersham, Argonne Forest, Peachtree Bat...",0.0
2,3736,"Peachtree Heights East, Peachtree Hills",0.0
3,4874,Peachtree Heights West,0.0
4,3372,"Buckhead Forest, South Tuxedo Park",0.0
5,3423,"Chastain Park , Tuxedo Park",0.0


The city dataframe contains information on the population, the neighborhoods, and assigns each neighborhood a unique Object ID. Additionally, we added an empty crime count data to fill. Now with the changes made, let's fill the crime count column with our data from the APD!

In [74]:
#Run through the city data frame 'Neighborhood' column
#For each neighborhood in the crime dataframe, see if that string is contained in the 'Neighborhood' column
#If the neighborhood is included in that row, add the crime_rate data to it. 
for neighborhood in neighborhood_list:
    index = df_city[df_city['Neighborhood'].str.contains(neighborhood) == True].index.values
    df_city.loc[index,'Crime Count'] = df_city.loc[index,'Crime Count']+df_crime_count.loc[neighborhood,'# of Crimes']

  return func(self, *args, **kwargs)


In [75]:
df_city.head(5)

Unnamed: 0_level_0,POP2010,Neighborhood,Crime Count
OBJECTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2672,"Arden/Habersham, Argonne Forest, Peachtree Bat...",507.0
2,3736,"Peachtree Heights East, Peachtree Hills",877.0
3,4874,Peachtree Heights West,1145.0
4,3372,"Buckhead Forest, South Tuxedo Park",3099.0
5,3423,"Chastain Park , Tuxedo Park",802.0


Now that we have all of the crime data, it is time to take a slight change in direction. When we open the geojson data in a text editor, we can see that each neighborhood is outlined by numerous latitude and longitude coordinates. These coordinates create a polygon around each neighborhood that is used by folium to overlay the neighborhood boundaries. Later on, we want to know how many coffee shops are in each neighborhood. FourSquare finds this information by sweeping an area within a certain distance from an established coordinate. Since we have the polygon information from the geojson file and not one coordinate, we must find the centroid coordinate of each neighborhood. Furthermore, we must find an appropriate distance for each neighborhood that will allow FourSquare to find all the coffee shops in a given area. To get started, we establish 5 new columns to the dataframe that will hold information regarding neighborhood centorid coordinates, the latitude and longitude boundaries of each neighborhood, and the number of coffe shops found in each.  

In [76]:
df_city['Lat'] = np.linspace(0,0,102)
df_city['Long'] = np.linspace(0,0,102)
df_city['Lat_max'] = np.linspace(0,0,102)
df_city['Lat_min'] = np.linspace(0,0,102)
df_city['Long_max'] = np.linspace(0,0,102)
df_city['Long_min'] = np.linspace(0,0,102)
df_city['Coffee Spots'] = np.linspace(0,0,102)

df_city.reset_index(inplace = True)

df_city.head()

Unnamed: 0,OBJECTID,POP2010,Neighborhood,Crime Count,Lat,Long,Lat_max,Lat_min,Long_max,Long_min,Coffee Spots
0,1,2672,"Arden/Habersham, Argonne Forest, Peachtree Bat...",507.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,3736,"Peachtree Heights East, Peachtree Hills",877.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,4874,Peachtree Heights West,1145.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,3372,"Buckhead Forest, South Tuxedo Park",3099.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,3423,"Chastain Park , Tuxedo Park",802.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


With the columns established, we then iterate through the geojson data of each neighborhood to find the center and boundaries of each.

In [77]:
#Open the geojson data
with open('d6298dee8938464294d3f49d473bcf15_196.geojson') as geojson_data:
    Atlanta_data = json.load(geojson_data)
Atlanta = Atlanta_data['features']

#Iterare through all neighborhoods
for i in range(0,102):
    
    #Three neighborhoods had oddities and are addressed with the if statement
    if i == 54 or i == 55 or i==61:
        One_Polygon = Atlanta[i]['geometry']['coordinates'][0][0]
    else:
        One_Polygon = Atlanta[i]['geometry']['coordinates'][0]
        
    All_Lats = np.linspace(0,0,len(One_Polygon))
    All_Long = np.linspace(0,0,len(One_Polygon))
    
    #Collect all the longitude and latitude data for a neighborhood
    for coor in range(0,len(One_Polygon)):
        All_Long[coor] = One_Polygon[coor][0]
        All_Lats[coor] = One_Polygon[coor][1]
    
    #Find the means (centers) and the boundaries (max and min)
    Lat_Centroid = np.mean(All_Lats)
    Long_Centroid = np.mean(All_Long)
   
    df_city.iloc[i,4] = Lat_Centroid
    df_city.iloc[i,5] = Long_Centroid
    
    df_city.iloc[i,6] = np.max(All_Lats)
    df_city.iloc[i,7] = np.min(All_Lats)
    df_city.iloc[i,8] = np.max(All_Long)
    df_city.iloc[i,9] = np.min(All_Long)
    

In [78]:
df_city.head()

Unnamed: 0,OBJECTID,POP2010,Neighborhood,Crime Count,Lat,Long,Lat_max,Lat_min,Long_max,Long_min,Coffee Spots
0,1,2672,"Arden/Habersham, Argonne Forest, Peachtree Bat...",507.0,33.828327,-84.398229,33.84565,33.815796,-84.388457,-84.408303,0.0
1,2,3736,"Peachtree Heights East, Peachtree Hills",877.0,33.817221,-84.383298,33.828671,33.812501,-84.372667,-84.390164,0.0
2,3,4874,Peachtree Heights West,1145.0,33.83307,-84.390663,33.84434,33.819926,-84.379929,-84.397227,0.0
3,4,3372,"Buckhead Forest, South Tuxedo Park",3099.0,33.846373,-84.384555,33.854188,33.839438,-84.371396,-84.395688,0.0
4,5,3423,"Chastain Park , Tuxedo Park",802.0,33.866798,-84.39692,33.886829,33.844027,-84.382653,-84.41366,0.0


Now we have the center coordinates and the latitude and longitude boundaries for each neighborhood. With the boundaries and centers, it is time to find the number of coffee shops in each area using FourSquare!

In [79]:
#Define function used to extract FourSquare information
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [80]:
for t in range(0,102):
    
    #Define the radius 1 degree in lat or long = 111.699 Km
    Lat_Radius = (abs(df_city.loc[t, 'Lat_max'] - df_city.loc[t, 'Lat_min'])*111.699)*1000
    Long_Radius = (abs(df_city.loc[t, 'Long_max'] - df_city.loc[t, 'Long_min'])*111.699)*1000
    Avg_Radius = (Long_Radius + Lat_Radius)/2 #Define the radius as the average 

    neighborhood_latitude = df_city.loc[t, 'Lat'] # neighborhood latitude value
    neighborhood_longitude = df_city.loc[t, 'Long'] # neighborhood longitude value

    url = 'https://api.foursquare.com/v2/venues/explore'

    #Call FourSquare
    params = dict(
    client_id='ZDVQFCPQGP3SVHURTL1QPUUJBDNGEXHFNQKIKWICBFCHOIWX',
    client_secret= 'MKTK13OP0VW3Y3TMMRZDTH0MMPVXX4MCFOTKUGUAY2RZT3FK',
    v='20200605',
    ll='{}, {}'.format(neighborhood_latitude, neighborhood_longitude),
    categoryId = '4bf58dd8d48988d1e0931735',
    limit = 200,
    radius = Avg_Radius
    )

    resp = requests.get(url=url, params=params)
    data = json.loads(resp.text)

    venues = data['response']['groups'][0]['items']

    nearby_venues = json_normalize(venues) # flatten JSON

    if nearby_venues.empty:
        df_city.iloc[t,10] = 0
    else:
        # filter columns
        filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
        nearby_venues =nearby_venues.loc[:, filtered_columns]

        # filter the category for each row
        nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

        # clean columns
        nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

        df_city.iloc[t,10] = nearby_venues.shape[0]



In [81]:
df_city.head()

Unnamed: 0,OBJECTID,POP2010,Neighborhood,Crime Count,Lat,Long,Lat_max,Lat_min,Long_max,Long_min,Coffee Spots
0,1,2672,"Arden/Habersham, Argonne Forest, Peachtree Bat...",507.0,33.828327,-84.398229,33.84565,33.815796,-84.388457,-84.408303,24.0
1,2,3736,"Peachtree Heights East, Peachtree Hills",877.0,33.817221,-84.383298,33.828671,33.812501,-84.372667,-84.390164,13.0
2,3,4874,Peachtree Heights West,1145.0,33.83307,-84.390663,33.84434,33.819926,-84.379929,-84.397227,27.0
3,4,3372,"Buckhead Forest, South Tuxedo Park",3099.0,33.846373,-84.384555,33.854188,33.839438,-84.371396,-84.395688,48.0
4,5,3423,"Chastain Park , Tuxedo Park",802.0,33.866798,-84.39692,33.886829,33.844027,-84.382653,-84.41366,61.0


Awesome! Using the FourSquare API and the coordinates extracted from the geojson file, we were able to find the number of coffee shops in each neighborhood! Now that we have all of the relevant information, let's save it all to a dataframe that only holds what we want to investigate. 

In [82]:
df_analysis = df_city[['Neighborhood','POP2010','Crime Count','Coffee Spots']]

In [83]:
df_analysis.set_index('Neighborhood',inplace=True)

In [84]:
#Drop the AUC and Airport since they do not have crime data. This skews the results, so they are dropped. 
df_analysis.drop('Airport', inplace = True)
df_analysis.drop('AUC', inplace= True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [85]:
df_analysis.head()

Unnamed: 0_level_0,POP2010,Crime Count,Coffee Spots
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Arden/Habersham, Argonne Forest, Peachtree Battle Alliance, Wyngate",2672,507.0,24.0
"Peachtree Heights East, Peachtree Hills",3736,877.0,13.0
Peachtree Heights West,4874,1145.0,27.0
"Buckhead Forest, South Tuxedo Park",3372,3099.0,48.0
"Chastain Park , Tuxedo Park",3423,802.0,61.0


Here it is! Isn't it gorgeous! This dataframe shows the population in each neighborhood, the crime count, and the number of coffee shops in the area. With this, it is time to undergo some analysis...

## Analysis <a name="analysis"></a>

With all of the relevant information, it is now time to find out where I should open Brandon's Better Brew! To do this, I will analyze the data via decision matrix analysis. In this analysis, each of the neighborhoods will be ranked within each criteria; population, crime, and coffee spots. Population will be ranked from most populace (ranked 1) to least populace (ranked 102). Crime will be ranked in the opposite order, with least crime receiving a rank of 1 and the most crime receiving a rank of 102. Lastly, coffee shops will be ranked in the same manner as crime, with the neighborhood with the least number of coffee shops receiving a rank of 1. Once ranked, a weight will be applied to each criteria. Though there is no exact method to establishing the weights, I believe that population is the most important criteria, with number of coffee shops in second, and crime being the least important. With this in mind, I will assign population 50% of the weight, coffee shops 30%, and crime 20%. The weights will be applied to each neighborhood, and the weighted ranks will be added across each neighborhood to produce a score that accounts for all three criteria. The neighborhood with the lowest score will be crowned the most suitable Atlanta neighborhood for my coffee shop. 

In [86]:
df_analysis['Pop Rank'] = df_analysis['POP2010'].rank(method='dense', ascending=False)
df_analysis['Crime Rank'] = df_analysis['Crime Count'].rank(method='dense', ascending=True)
df_analysis['Coffee Rank'] = df_analysis['Coffee Spots'].rank(method='dense', ascending=True)
df_ranks = df_analysis[['Pop Rank','Crime Rank','Coffee Rank']]
df_ranks.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0_level_0,Pop Rank,Crime Rank,Coffee Rank
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Arden/Habersham, Argonne Forest, Peachtree Battle Alliance, Wyngate",79.0,9.0,21.0
"Peachtree Heights East, Peachtree Hills",45.0,18.0,13.0
Peachtree Heights West,22.0,21.0,23.0
"Buckhead Forest, South Tuxedo Park",54.0,64.0,32.0
"Chastain Park , Tuxedo Park",53.0,16.0,36.0


The table above shows a snapshot of some of the neighborhood rankings. Below, we will apply the weights and tally the weighted rankings to create a score for each neighborhood.  

In [87]:
df_ranks['W Pop'] = 0.5*df_ranks['Pop Rank']
df_ranks['W Crime'] = 0.2*df_ranks['Crime Rank']
df_ranks['W Coffee'] = 0.3*df_ranks['Coffee Rank']
df_ranks['Score'] = df_ranks['W Pop']+df_ranks['W Crime']+df_ranks['W Coffee']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [88]:
df_ranks.reset_index(inplace=True)

In [89]:
df_ranks.head()

Unnamed: 0,Neighborhood,Pop Rank,Crime Rank,Coffee Rank,W Pop,W Crime,W Coffee,Score
0,"Arden/Habersham, Argonne Forest, Peachtree Bat...",79.0,9.0,21.0,39.5,1.8,6.3,47.6
1,"Peachtree Heights East, Peachtree Hills",45.0,18.0,13.0,22.5,3.6,3.9,30.0
2,Peachtree Heights West,22.0,21.0,23.0,11.0,4.2,6.9,22.1
3,"Buckhead Forest, South Tuxedo Park",54.0,64.0,32.0,27.0,12.8,9.6,49.4
4,"Chastain Park , Tuxedo Park",53.0,16.0,36.0,26.5,3.2,10.8,40.5


Above, you can see the score for each neighborhood. Now, we will find the area with the lowest score, and thus, the best conditions for opening a shop. 

In [90]:
print('The best area to open a coffee shop is:', df_ranks.loc[df_ranks[df_ranks['Score'] == df_ranks['Score'].min()].index.values,'Neighborhood'])
print('The weighted rank is: ', df_ranks['Score'].min())

The best area to open a coffee shop is: 71    Ivan Hill
Name: Neighborhood, dtype: object
The weighted rank is:  10.899999999999999


Amazing! So from our analysis, we found that Ivan Hill is the most suited for my coffee shop and has a score of 10.899! Let's visualize the scores of the neighborhood with a choropleth map and find where to build this money maker!

In [91]:
!wget --quiet https://opendata.arcgis.com/datasets/d6298dee8938464294d3f49d473bcf15_196.geojson
print('GeoJSON file downloaded!')

Atlanta_geo = r'd6298dee8938464294d3f49d473bcf15_196.geojson' # geojson file

GeoJSON file downloaded!


In [92]:
Best_lat = 33.743242
Best_long = -84.488256
address = 'Atlanta, GA'
geolocator = Nominatim(user_agent="Atlanta_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude


Atlanta_score_map = folium.Map(location = [latitude, longitude], zoom_start =12)
Atlanta_score_map.choropleth(
    geo_data= Atlanta_geo,
    data= df_ranks,
    columns=['Neighborhood','Score'],
    fill_color='YlGn', 
    key_on = 'feature.properties.NEIGHBORHO',
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Overall Score')

folium.features.CircleMarker(
    [Best_lat, Best_long],
    radius=10,
    color='red',
    popup='The Estimated Coordinates',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(Atlanta_score_map)

Atlanta_score_map


In [93]:
df_ranks.set_index('Neighborhood',inplace=True)

In [94]:
#Sorted list of neighborhoods by score
df_ranks.sort_values(by = ['Score'])

Unnamed: 0_level_0,Pop Rank,Crime Rank,Coffee Rank,W Pop,W Crime,W Coffee,Score
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ivan Hill,19.0,1.0,4.0,9.5,0.2,1.2,10.9
Pine Hills,7.0,29.0,34.0,3.5,5.8,10.2,19.5
"Georgia Tech, Marietta Street Artery",8.0,32.0,35.0,4.0,6.4,10.5,20.9
"Baker Hills, Bakers Ferry, Boulder Park, Fairburn Road/Wisteria Lane, Ridgecrest Forest, Wildwood (NPU-H), Wilson Mill Meadows, Wisteria Gardens",27.0,27.0,7.0,13.5,5.4,2.1,21.0
"Campbellton Road, Fort Valley, Pomona Park",10.0,78.0,3.0,5.0,15.6,0.9,21.5
Peachtree Heights West,22.0,21.0,23.0,11.0,4.2,6.9,22.1
Collier Heights,15.0,69.0,3.0,7.5,13.8,0.9,22.2
"Bolton , Riverside, Whittier Mill Village",17.0,52.0,14.0,8.5,10.4,4.2,23.1
"Adams Park, Laurens Valley, Southwest",12.0,82.0,5.0,6.0,16.4,1.5,23.9
"Atkins Park, Virginia Highland",4.0,74.0,25.0,2.0,14.8,7.5,24.3


In [95]:
#Information for Ivan Hill
df_ranks.loc['Ivan Hill']

Pop Rank       19.0
Crime Rank      1.0
Coffee Rank     4.0
W Pop           9.5
W Crime         0.2
W Coffee        1.2
Score          10.9
Name: Ivan Hill, dtype: float64

In [96]:
#Mean score of all neighborhoods
np.mean(df_ranks['Score'])

39.035

## Results & Discussion <a name="results"></a>

Out of the 102 neighborhoods in Atlanta classified by the ARC, the best neighborhood to open my new coffee business would be ***Ivan Hill***. Ivan Hills, without the weights, ranked 19th, 1st, and 4th for population, crime, and coffee shops, respectively. The weighted score of Ivan Hill, at 10.9, was 8.6 points better than the runner-up, Pine Hills. The average score for all neighborhoods was 39.035, making the score of Ivan Hill especailly impressive. Looking at the map, we can see that the surrounding neighborhoods of Ivan Hill also have favorable scores. There does not appear to be a clear pattern from the map that shows any indication that it is better to open a coffee shop further from the center of the city. The weighted analysis does a good job at taking into consideration the different criteria used. While the inner city has a large population, it also has higher crime rates and more coffee shops, making it less favorable than a neighborhood such as Ivan Hill, that may not have as great of a population, but has much less crime and shops. 

## Conclusion <a name="conclusion"></a>

The purpose of this project was to identify the best neighborhood in Atlanta, GA to open a new coffee shop. We looked at the crime rate, population, and coffe shop saturation in each area to determine the neighborhood most deserving of Brandon's Better Brew. We used the geojson and population data provided by the Atlanta Regional Commission, crime data provided by the Atlanta Police Department, and coffee shop information provided through FourSquare's developer API. Each neighborhood was assigned a rank for each of the criteria and the a weighted decision matrix analysis was performed to calculate a score for each neighborhood. This score was used to determine which neighborhood would be the most sutiable to open a new successful coffee business. The scores were mapped on a chloropeth map of the Atlanta neighborhoods and the neighborhood with the best (lowest) score was found and displayed on the map. 

Given the factors considered, I believe that the Ivan Hills neighborhood is the best area to open a new coffee shop. With this information, I will talk to investors to find funding for the new project! 

## References

[1] https://disturbmenot.co/coffee-statistics/#:~:text=The%20Most%20Amazing%20Coffee%20Statistics,Americans%20prepare%20coffee%20at%20home.&text=About%2029%25%20of%20US%20coffee%20consumers%20drink%20coffee%20to%20relax.

[2] https://dailycoffeenews.com/2018/10/31/allegra-predicts-specialty-segment-growth-in-2019-us-coffee-shop-report/#:~:text=The%20number%20of%20coffee%2Dfocused,from%20Allegra%20World%20Coffee%20Portal.

[3] https://www.statista.com/statistics/680152/coffee-brand-loyalty-us-consumers/