# A New Restaurant in Atlanta

## Introduction to the problem

Atlanta is the 9th most-popolous metropolitan area in the United States and is home to a diverse collection of people, activities, locations, and, of course, food. Heralded as "The Empire City of the South", Atlanta has over 6 million people who call it home and several Fortune 500 companies who are headquartered in it. Everyone has to eat, and there are thousands of restaurants within Atlanta. In this project we will explore possible locations to build a new Mexican Restaurant within Atlanta. We will consider such features as demographics, crime data, and neighborhoods among other features. 

## Data

##### Crime Data
We will be using data from the Atlanta Police Department for Crime Data. You can find that here: https://www.atlantapd.org/Home/ShowDocument?id=3051

##### Demographic Data
We will be using Demographic data from the uszipcode package.
This data will help us refine our search to just Atlanta as well as maybe find different neighborhoods within Atlanta.

##### Restaurant Data
Restaurant Data will come from Foursquare using their API in this Python Notebook.

## Methodology

In [1]:
import pandas as pd
import requests, zipfile
import io
from bs4 import BeautifulSoup

In [2]:
zip_file_url = 'https://www.atlantapd.org/Home/ShowDocument?id=3051'

r = requests.get(zip_file_url, stream = True)
z = zipfile.ZipFile(io.BytesIO(r.content))

In [3]:
#See what files are inside
z.infolist()

[<ZipInfo filename='COBRA-2009-2019.csv' compress_type=deflate external_attr=0x20 file_size=60871571 compress_size=13142288>,
 <ZipInfo filename='READ ME.txt' compress_type=deflate external_attr=0x20 file_size=179 compress_size=122>]

In [4]:
crime_df = pd.read_csv( z.open(z.infolist()[0].filename), parse_dates = [1, 2, 4], infer_datetime_format = True )

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
crime_df.dtypes

Report Number                       int64
Report Date                datetime64[ns]
Occur Date                 datetime64[ns]
Occur Time                         object
Possible Date              datetime64[ns]
Possible Time                     float64
Beat                              float64
Apartment Office Prefix            object
Apartment Number                   object
Location                           object
Shift Occurence                    object
Location Type                      object
UCR Literal                        object
UCR #                               int64
IBR Code                           object
Neighborhood                       object
NPU                                object
Latitude                          float64
Longitude                         float64
dtype: object

In [6]:
crime_df.head()

Unnamed: 0,Report Number,Report Date,Occur Date,Occur Time,Possible Date,Possible Time,Beat,Apartment Office Prefix,Apartment Number,Location,Shift Occurence,Location Type,UCR Literal,UCR #,IBR Code,Neighborhood,NPU,Latitude,Longitude
0,90010930,2009-01-01,2009-01-01,1145,2009-01-01,1148.0,411.0,,,2841 GREENBRIAR PKWY,Day Watch,8,LARCENY-NON VEHICLE,630,2303,Greenbriar,R,33.68845,-84.49328
1,90011083,2009-01-01,2009-01-01,1330,2009-01-01,1330.0,511.0,,,12 BROAD ST SW,Day Watch,9,LARCENY-NON VEHICLE,630,2303,Downtown,M,33.7532,-84.39201
2,90011208,2009-01-01,2009-01-01,1500,2009-01-01,1520.0,407.0,,,3500 MARTIN L KING JR DR SW,Unknown,8,LARCENY-NON VEHICLE,630,2303,Adamsville,H,33.75735,-84.50282
3,90011218,2009-01-01,2009-01-01,1450,2009-01-01,1510.0,210.0,,,3393 PEACHTREE RD NE,Evening Watch,8,LARCENY-NON VEHICLE,630,2303,Lenox,B,33.84676,-84.36212
4,90011289,2009-01-01,2009-01-01,1600,2009-01-01,1700.0,411.0,,,2841 GREENBRIAR PKWY SW,Unknown,8,LARCENY-NON VEHICLE,630,2303,Greenbriar,R,33.68677,-84.49773


In [7]:
#Get the number of crimes per neighborhood
n_crimes = crime_df[['Report Number', 'Neighborhood']].groupby(['Neighborhood']).agg('count')
n_crimes.rename(columns={'Report Number': 'crimes'}, inplace = True)
n_crimes.head()

Unnamed: 0_level_0,crimes
Neighborhood,Unnamed: 1_level_1
Adair Park,2012
Adams Park,1504
Adamsville,2798
Almond Park,850
Amal Heights,372


In [8]:
#Get average lat and long by neighborhood
n_coords = crime_df[['Latitude', 'Longitude', 'Neighborhood']].groupby(['Neighborhood']).agg('mean')
n_coords.head()

Unnamed: 0_level_0,Latitude,Longitude
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1
Adair Park,33.729698,-84.410426
Adams Park,33.713987,-84.460214
Adamsville,33.758748,-84.503608
Almond Park,33.784186,-84.46047
Amal Heights,33.708719,-84.398984


In [9]:
#Alright, so now we have the average location as well as the number of crimes per neighborhood, let's merge them
df_merged = pd.merge(n_crimes, n_coords, left_index = True, right_index = True)
df_merged.head()

Unnamed: 0_level_0,crimes,Latitude,Longitude
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adair Park,2012,33.729698,-84.410426
Adams Park,1504,33.713987,-84.460214
Adamsville,2798,33.758748,-84.503608
Almond Park,850,33.784186,-84.46047
Amal Heights,372,33.708719,-84.398984


In [10]:
#Let's get the zip codes of these neighborhoods
import geopy

def get_zipcode(df, geolocator, lat_field, lon_field):
    location = geolocator.reverse((df[lat_field], df[lon_field]))
    return location.raw['address']['postcode']

geolocator = geopy.Nominatim(user_agent='capstone')
zipcodes = df_merged.apply(get_zipcode, axis=1, geolocator=geolocator, lat_field='Latitude', lon_field='Longitude')

In [11]:
zip_df = pd.DataFrame(zipcodes)
zip_df.rename(columns = {0: 'zip5'}, inplace = True)
zip_df.head()

Unnamed: 0_level_0,zip5
Neighborhood,Unnamed: 1_level_1
Adair Park,30310
Adams Park,30311
Adamsville,30311
Almond Park,30318
Amal Heights,30315


In [12]:
#Now we merge, again
df_merged = pd.merge(df_merged, zip_df, left_index = True, right_index = True)
df_merged.head()

Unnamed: 0_level_0,crimes,Latitude,Longitude,zip5
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adair Park,2012,33.729698,-84.410426,30310
Adams Park,1504,33.713987,-84.460214,30311
Adamsville,2798,33.758748,-84.503608,30311
Almond Park,850,33.784186,-84.46047,30318
Amal Heights,372,33.708719,-84.398984,30315


In [13]:
#Going to use uszipcode to get some more info
!pip install uszipcode

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting uszipcode
  Downloading uszipcode-0.2.4-py2.py3-none-any.whl (378 kB)
[K     |████████████████████████████████| 378 kB 25.8 MB/s eta 0:00:01
[?25hCollecting pathlib-mate
  Downloading pathlib_mate-1.0.1-py2.py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 18.4 MB/s eta 0:00:01
Collecting autopep8
  Downloading autopep8-1.5.6-py2.py3-none-any.whl (44 kB)
[K     |████████████████████████████████| 44 kB 8.2 MB/s  eta 0:00:01
Collecting atomicwrites
  Downloading atomicwrites-1.4.0-py2.py3-none-any.whl (6.8 kB)
Collecting toml
  Downloading toml-0.10.2-py2.py3-none-any.whl (16 kB)
Collecting pycodestyle>=2.7.0
  Downloading pycodestyle-2.7.0-py2.py3-none-any.whl (41 kB)
[K     |████████████████████████████████| 41 kB 1.3 MB/s  eta 0:00:01
[?25hInstalling collected packages: toml, pycodestyle, autopep8, atomicwrites, pathlib-mate, uszipcode
Successfully in

In [14]:
from uszipcode import SearchEngine
search = SearchEngine(simple_zipcode=True)

Start downloading data for simple zipcode database, total size 9MB ...
  1 MB finished ...
  2 MB finished ...
  3 MB finished ...
  4 MB finished ...
  5 MB finished ...
  6 MB finished ...
  7 MB finished ...
  8 MB finished ...
  9 MB finished ...
  10 MB finished ...
  Complete!


In [15]:
#Initializing the data frame that will have all of the zipcode info
zipcode = search.by_zipcode(30309)
dict_zipcode = zipcode.to_dict()

#We need to delete keys as they cause issues when converting to a DF, plus we won't use them
del dict_zipcode['area_code_list']
del dict_zipcode['common_city_list']

#The actual df
zip_df = pd.DataFrame(dict_zipcode, index=[0])

In [16]:
#Now we will append every zip in our merged data frame to the above data frame

#First we minimize the amount of calls to zipcode by creating a DF with only distinct values of zip5
zip_loops = df_merged.loc[:,'zip5']
zip_loops.drop_duplicates(inplace = True)

#Now we call all of the zipcodes in the above DF
for code in zip_loops:
    zipcode = search.by_zipcode(code)
    
    #Create the dictionary and remove the bad keys
    dict_zipcode = zipcode.to_dict()
    del dict_zipcode['area_code_list']
    del dict_zipcode['common_city_list']
    
    #Create a temporary df and append
    temp_zip_df = pd.DataFrame(dict_zipcode, index=[0])
    zip_df = pd.concat([zip_df, temp_zip_df])

In [17]:
zip_df.drop_duplicates(inplace = True)
print(zip_df.columns)
zip_df.head()

Index(['zipcode', 'zipcode_type', 'major_city', 'post_office_city', 'county',
       'state', 'lat', 'lng', 'timezone', 'radius_in_miles', 'population',
       'population_density', 'land_area_in_sqmi', 'water_area_in_sqmi',
       'housing_units', 'occupied_housing_units', 'median_home_value',
       'median_household_income', 'bounds_west', 'bounds_east', 'bounds_north',
       'bounds_south'],
      dtype='object')


Unnamed: 0,zipcode,zipcode_type,major_city,post_office_city,county,state,lat,lng,timezone,radius_in_miles,...,land_area_in_sqmi,water_area_in_sqmi,housing_units,occupied_housing_units,median_home_value,median_household_income,bounds_west,bounds_east,bounds_north,bounds_south
0,30309,Standard,Atlanta,"Atlanta, GA",Fulton County,GA,33.8,-84.39,Eastern,2.0,...,3.42,0.04,16207,13730,288800,71854,-84.407849,-84.36857,33.818801,33.777831
0,30310,Standard,Atlanta,"Atlanta, GA",Fulton County,GA,33.73,-84.43,Eastern,3.0,...,8.82,0.01,14349,10697,89300,22861,-84.466965,-84.394397,33.754598,33.696383
0,30311,Standard,Atlanta,"Atlanta, GA",Fulton County,GA,33.73,-84.47,Eastern,3.0,...,12.43,0.04,15636,13125,121200,27651,-84.502793,-84.434022,33.764465,33.68457
0,30318,Standard,Atlanta,"Atlanta, GA",Fulton County,GA,33.79,-84.44,Eastern,4.0,...,20.36,0.18,25475,19812,174800,39421,-84.498731,-84.390567,33.832056,33.754464
0,30315,Standard,Atlanta,"Atlanta, GA",Fulton County,GA,33.7,-84.38,Eastern,3.0,...,11.31,0.02,14791,11771,111000,20951,-84.418328,-84.346205,33.741619,33.672925


In [18]:
#We only want a few of these columns
zip_df = zip_df[['zipcode', 'major_city', 'population', 'population_density', 'median_home_value', 'median_household_income']]
zip_df.rename(columns={'zipcode': 'zip5'}, inplace = True)
zip_df.head()

Unnamed: 0,zip5,major_city,population,population_density,median_home_value,median_household_income
0,30309,Atlanta,21845,6391.0,288800,71854
0,30310,Atlanta,26912,3051.0,89300,22861
0,30311,Atlanta,32218,2592.0,121200,27651
0,30318,Atlanta,49736,2442.0,174800,39421
0,30315,Atlanta,33857,2992.0,111000,20951


In [19]:
#Now we merge

#First we copy over the index as a column so it doesn't get lost
df_merged.reset_index(inplace=True)

#Then we merge
df_merged = pd.merge(df_merged, zip_df, left_on = 'zip5', right_on = 'zip5')
df_merged.head()

Unnamed: 0,Neighborhood,crimes,Latitude,Longitude,zip5,major_city,population,population_density,median_home_value,median_household_income
0,Adair Park,2012,33.729698,-84.410426,30310,Atlanta,26912,3051.0,89300,22861
1,Bush Mountain,340,33.727468,-84.430976,30310,Atlanta,26912,3051.0,89300,22861
2,Capitol View,1949,33.717322,-84.413997,30310,Atlanta,26912,3051.0,89300,22861
3,Capitol View Manor,400,33.717435,-84.404319,30310,Atlanta,26912,3051.0,89300,22861
4,Florida Heights,1719,33.750605,-84.464381,30310,Atlanta,26912,3051.0,89300,22861


In [20]:
#Let's visualize using folium
!pip install folium #You may need to install folium as I had to

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 7.9 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1


In [21]:
#First import it
import folium
from IPython.display import Image 
#from IPython.core.display import HTML 
import matplotlib.cm as cm
import matplotlib.colors as colors

In [23]:
#Prepare our map
map_atlanta = folium.Map(location=[33.7490,-84.3880],zoom_start=13)

for lat, lng, neighbourhood in zip(df_merged['Latitude'],df_merged['Longitude'],df_merged['Neighborhood']):
    label = '{}'.format(neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=10,
    popup=label,
    color='gold',
    fill=True,
    fill_color='blue',
    fill_opacity=0.5,
    parse_html=False).add_to(map_atlanta)
   

"""
#Here is another icon I liked a bit but ended up not using
    folium.Marker(
        location=[lat, lng], 
        icon=folium.Icon(color="red",icon="fa-fire", prefix='fa'),
        opacity=0.7).add_to(map_atlanta)
"""

'\n#Here is another icon I liked a bit but ended up not using\n    folium.Marker(\n        location=[lat, lng], \n        icon=folium.Icon(color="red",icon="fa-fire", prefix=\'fa\'),\n        opacity=0.7).add_to(map_atlanta)\n'

In [24]:
#Let's see our map
map_atlanta

### We will now use Foursquare to find info on restaurants

In [33]:
#Let's input our credentials:
CLIENT_ID = 'NZIDRXI4IRRB3DII2ASPTGF0HV1WSO0GU2OA3LTOOGS2DZF4'
CLIENT_SECRET = 'DUCGZICGYFKAGH3125SZELPDAFYQLN1THHTAXJ5R4ODG4XC1'
VERSION = '20180605'

In [68]:
#Get the venues we want
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

In [69]:
venues_in_atlanta = getNearbyVenues(df_merged['Neighborhood'], df_merged['Latitude'], df_merged['Longitude'])

Adair Park
Bush Mountain
Capitol View
Capitol View Manor
Florida Heights
Harris Chiles
Oakland City
Pittsburgh
Sylvan Hills
West End
Westview
Westwood Terrace
Adams Park
Adamsville
Audobon Forest
Audobon Forest West
Beecher Hills
Campbellton Road
Cascade Avenue/Road
Cascade Heights
Chalet Woods
East Ardley Road
Green Acres Valley
Green Forest Acres
Harland Terrace
Horseshoe Community
Ivan Hill
Laurens Valley
Magnum Manor
Peyton Forest
Pomona Park
Southwest
Venetian Hills
West Manor
Westhaven
Almond Park
Berkeley Park
Bolton
Bolton Hills
Brookview Heights
Carey Park
Carver Hills
Center Hill
Channing Valley
Chattahoochee
Collier Heights
Collier Hills
Cross Creek
English Avenue
Fernleaf
Grove Park
Harvel Homes Community
Hills Park
Home Park
Knight Park/Howell Station
Lincoln Homes
Marietta Street Artery
Monroe Heights
Ridgewood Heights
Riverside
Rockdale
Scotts Crossing
Springlake
Underwood Hills
West Highlands
Westover Plantation
Whittier Mill Village
Wildwood (NPU-C)
Amal Heights
Bentee

In [89]:
print(venues_in_atlanta.shape)

#I like this spelling better
venues_in_atlanta.rename(columns={'Neighbourhood':'Neighborhood', 
                        'Neighbourhood Latitude':'Neighborhood Latitude', 
                        'Neighbourhood Longitude':'Neighborhood Longitude'}, inplace = True)

venues_in_atlanta.groupby('Neighbourhood').head()

(1855, 5)


KeyError: 'Neighbourhood'

In [90]:
restaurants_in_atlanta = venues_in_atlanta[venues_in_atlanta['Venue Category'].str.contains("Restaurant")]
atl_mex = restaurants_in_atlanta[restaurants_in_atlanta['Venue Category'].str.contains("Mex")]
atl_mex.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Category
185,Berkeley Park,33.801969,-84.412627,La Parrilla,Mexican Restaurant
244,Bolton,33.816475,-84.450677,Carniceria Ramirez,Mexican Restaurant
272,Channing Valley,33.808088,-84.411759,Chipotle Mexican Grill,Mexican Restaurant
275,Channing Valley,33.808088,-84.411759,Moe's Southwest Grill,Mexican Restaurant
373,Marietta Street Artery,33.776211,-84.407223,bartaco,Mexican Restaurant
417,Springlake,33.814054,-84.411267,Chipotle Mexican Grill,Mexican Restaurant
467,Wildwood (NPU-C),33.809361,-84.414524,Chipotle Mexican Grill,Mexican Restaurant
474,Wildwood (NPU-C),33.809361,-84.414524,Moe's Southwest Grill,Mexican Restaurant
484,Benteen Park,33.714734,-84.364517,Carniceria Y Tiendas El Progresso,Mexican Restaurant
595,Brookwood,33.802098,-84.397085,Chipotle Mexican Grill,Mexican Restaurant


In [94]:
#Now we visualize in Folium once more
#Prepare our map
mex_atlanta = folium.Map(location=[33.7490,-84.3880],zoom_start=12)

for lat, lng, neighbourhood in zip(atl_mex['Neighborhood Latitude'], atl_mex['Neighborhood Longitude'], atl_mex['Neighborhood']):
    label = '{}'.format(neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.Marker(
        location=[lat, lng], 
        icon=folium.Icon(color="red",icon="fa-fire", prefix='fa'),
        opacity=0.9).add_to(mex_atlanta)

mex_atlanta

In [96]:
#Let's overlay the mexican restaurant map on the Atlanta restaurant map:
for lat, lng, neighbourhood in zip(atl_mex['Neighborhood Latitude'], atl_mex['Neighborhood Longitude'], atl_mex['Neighborhood']):
    label = '{}'.format(neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.Marker(
        location=[lat, lng], 
        icon=folium.Icon(color="red",icon="fa-fire", prefix='fa'),
        opacity=0.9).add_to(map_atlanta)

map_atlanta

It looks like there are some neighborhoods that don't have a Mexican Restaurant. Let's see what charactersitics the neighborhoods that do have a Mexican Restaurant have.

In [105]:
mex_df = pd.merge(df_merged, atl_mex, how = 'left', left_on = 'Neighborhood', right_on = 'Neighborhood', indicator = True)
mex_df.head()

Unnamed: 0,Neighborhood,crimes,Latitude,Longitude,zip5,major_city,population,population_density,median_home_value,median_household_income,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Category,_merge
0,Adair Park,2012,33.729698,-84.410426,30310,Atlanta,26912,3051.0,89300,22861,,,,,left_only
1,Bush Mountain,340,33.727468,-84.430976,30310,Atlanta,26912,3051.0,89300,22861,,,,,left_only
2,Capitol View,1949,33.717322,-84.413997,30310,Atlanta,26912,3051.0,89300,22861,,,,,left_only
3,Capitol View Manor,400,33.717435,-84.404319,30310,Atlanta,26912,3051.0,89300,22861,,,,,left_only
4,Florida Heights,1719,33.750605,-84.464381,30310,Atlanta,26912,3051.0,89300,22861,,,,,left_only


In [111]:
#Now let's creater a one-hot encoding for the neighborhoods that have a Mexican restaurant vs thos who don't
mex_df.drop_duplicates(subset=['Neighborhood'], inplace = True)
mex_df_hot = pd.get_dummies(mex_df[['_merge']], prefix="", prefix_sep="")
mex_df_hot['Neighborhood'] = mex_df['Neighborhood']
mex_df_hot.head()

Unnamed: 0,left_only,right_only,both,Neighborhood
0,1,0,0,Adair Park
1,1,0,0,Bush Mountain
2,1,0,0,Capitol View
3,1,0,0,Capitol View Manor
4,1,0,0,Florida Heights


In [114]:
#Now we merge again to have the final data set (before the machine learning)
final_df = pd.merge(df_merged, mex_df_hot, left_on = 'Neighborhood', right_on = 'Neighborhood')
final_df.head()

Unnamed: 0,Neighborhood,crimes,Latitude,Longitude,zip5,major_city,population,population_density,median_home_value,median_household_income,left_only,right_only,both
0,Adair Park,2012,33.729698,-84.410426,30310,Atlanta,26912,3051.0,89300,22861,1,0,0
1,Bush Mountain,340,33.727468,-84.430976,30310,Atlanta,26912,3051.0,89300,22861,1,0,0
2,Capitol View,1949,33.717322,-84.413997,30310,Atlanta,26912,3051.0,89300,22861,1,0,0
3,Capitol View Manor,400,33.717435,-84.404319,30310,Atlanta,26912,3051.0,89300,22861,1,0,0
4,Florida Heights,1719,33.750605,-84.464381,30310,Atlanta,26912,3051.0,89300,22861,1,0,0


In [130]:
#Let's create some clusters and see if there's any cluister that has a lot of neighborhoods with Mexican Restaurants
#First we import K-Means from sk-learn
from sklearn.cluster import KMeans
import numpy as np

In [125]:
#Let's see what happens when k is only 5
k=5
atlanta_k_df = final_df[['Latitude', 'Longitude']]
kmeans = KMeans(n_clusters = k,random_state=0).fit(atlanta_k_df)
kmeans.labels_

array([3, 3, 3, 3, 0, 3, 3, 3, 3, 3, 0, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0, 2,
       2, 2, 0, 2, 0, 2, 2, 0, 3, 2, 3, 2, 0, 0, 4, 0, 0, 0, 0, 0, 0, 4,
       0, 0, 4, 4, 1, 4, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0, 0, 0, 4, 4, 0, 4,
       0, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 1, 4, 4, 4, 4, 4, 1, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 2, 2, 0, 0, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2,
       0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 3, 3, 0, 0, 3, 0, 0, 3, 3, 0, 1, 0,
       1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 1, 3, 1, 1, 1, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 3, 3, 1, 1, 1, 1,
       1, 3, 1, 3, 4, 4, 4, 1, 1, 1, 1, 4, 1, 1, 1, 4, 2, 3], dtype=int32)

In [128]:
# I have been having some issues with the Cluster Lables, so I am adding this for robustness
if 'Cluster Labels' in final_df.columns:
    final_df.drop(['Cluster Labels'],1, inplace = True)
final_df_k5 = final_df
final_df_k5.insert(0, 'Cluster Labels', kmeans.labels_)
final_df_k5.head()

Unnamed: 0,Cluster Labels,Neighborhood,crimes,Latitude,Longitude,zip5,major_city,population,population_density,median_home_value,median_household_income,left_only,right_only,both
0,3,Adair Park,2012,33.729698,-84.410426,30310,Atlanta,26912,3051.0,89300,22861,1,0,0
1,3,Bush Mountain,340,33.727468,-84.430976,30310,Atlanta,26912,3051.0,89300,22861,1,0,0
2,3,Capitol View,1949,33.717322,-84.413997,30310,Atlanta,26912,3051.0,89300,22861,1,0,0
3,3,Capitol View Manor,400,33.717435,-84.404319,30310,Atlanta,26912,3051.0,89300,22861,1,0,0
4,0,Florida Heights,1719,33.750605,-84.464381,30310,Atlanta,26912,3051.0,89300,22861,1,0,0


In [132]:
# create map
map_clusters = folium.Map(location=[33.7490,-84.3880],zoom_start=12)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(final_df_k5['Latitude'], 
                                            final_df_k5['Longitude'], 
                                            final_df_k5['Neighborhood'], 
                                            final_df_k5['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color='white',
        fill_opacity=0.9).add_to(map_clusters)
       
map_clusters

In [133]:
#Let's overlay the Mexican Restaurtants on this map again
for lat, lng, neighbourhood in zip(atl_mex['Neighborhood Latitude'], atl_mex['Neighborhood Longitude'], atl_mex['Neighborhood']):
    label = '{}'.format(neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.Marker(
        location=[lat, lng], 
        icon=folium.Icon(color="red",icon="fa-fire", prefix='fa'),
        opacity=0.9).add_to(map_clusters)

map_clusters

## Results 

In [135]:
#Which Cluster has the most Mexican Restaurants?
final_df_k5[['Cluster Labels', 'both']].groupby(['Cluster Labels']).sum()

Unnamed: 0_level_0,both
Cluster Labels,Unnamed: 1_level_1
0,1
1,7
2,0
3,2
4,13


It Seems like cluster 4 has the most Mexican Restaurants. Is there any space for one more?

In [136]:
final_df_k5[['Cluster Labels', 'both']].groupby(['Cluster Labels']).count()

Unnamed: 0_level_0,both
Cluster Labels,Unnamed: 1_level_1
0,47
1,30
2,56
3,50
4,55


Plenty of space! The Mexican Restaurant should probably go here. Let's see crime statistics and other stats.

In [137]:
final_df_k5.head()

Unnamed: 0,Cluster Labels,Neighborhood,crimes,Latitude,Longitude,zip5,major_city,population,population_density,median_home_value,median_household_income,left_only,right_only,both
0,3,Adair Park,2012,33.729698,-84.410426,30310,Atlanta,26912,3051.0,89300,22861,1,0,0
1,3,Bush Mountain,340,33.727468,-84.430976,30310,Atlanta,26912,3051.0,89300,22861,1,0,0
2,3,Capitol View,1949,33.717322,-84.413997,30310,Atlanta,26912,3051.0,89300,22861,1,0,0
3,3,Capitol View Manor,400,33.717435,-84.404319,30310,Atlanta,26912,3051.0,89300,22861,1,0,0
4,0,Florida Heights,1719,33.750605,-84.464381,30310,Atlanta,26912,3051.0,89300,22861,1,0,0


In [166]:
#Let's get rid of the NaN in columns we want stats for
true_final_df_k5 = final_df_k5.dropna(axis = 0, 
                                      subset = ['Cluster Labels', 
                                                'crimes', 
                                                'population', 
                                                'median_home_value', 
                                                'median_household_income'])

true_final_df_k5[['Cluster Labels','crimes']].groupby(['Cluster Labels']).sum()

Unnamed: 0_level_0,crimes
Cluster Labels,Unnamed: 1_level_1
0,46661
1,109919
2,28382
3,90115
4,49602


Cluster 4 also has less crime 

In [169]:
#Let's get some more stats

#We need to convert some columns to numeric
true_final_df_k5['median_household_income'] = pd.to_numeric(true_final_df_k5['median_household_income'])
true_final_df_k5['median_home_value'] = pd.to_numeric(true_final_df_k5['median_home_value'])
true_final_df_k5['population'] = pd.to_numeric(true_final_df_k5['population'])

#Now get the stats
true_final_df_k5.groupby(['Cluster Labels']).mean()

Unnamed: 0_level_0,crimes,Latitude,Longitude,population,population_density,median_home_value,median_household_income,left_only,right_only,both
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,992.787234,33.773054,-84.471066,42691.191489,2595.446809,153321.276596,36101.957447,0.978723,0.0,0.021277
1,3663.966667,33.76817,-84.359895,22334.366667,4511.5,286223.333333,58076.766667,0.766667,0.0,0.233333
2,506.821429,33.706347,-84.510679,49256.857143,1723.767857,145967.857143,38528.482143,1.0,0.0,0.0
3,1802.3,33.714504,-84.393134,27004.9,3329.56,113288.0,25900.46,0.96,0.0,0.04
4,901.854545,33.832431,-84.402909,27976.872727,3231.909091,450732.727273,87637.836364,0.763636,0.0,0.236364


Looks like cluster 4 also has a the highest median_household_income, although the median_home_value may mean rent may be expensive

Now, let's decide which neighborhood within cluster 4, we should settle in:

In [180]:
#Let's get neighborhoods in Cluster 4 without a Mexican Restaurant
final_df_k5[ (final_df_k5['Cluster Labels'] == 4) & (final_df_k5['both'] != 1) ].sort_values(by = ['median_household_income'], 
                                                                                             ascending = False)

Unnamed: 0,Cluster Labels,Neighborhood,crimes,Latitude,Longitude,zip5,major_city,population,population_density,median_home_value,median_household_income,left_only,right_only,both
192,4,Castlewood,90,33.833115,-84.412728,30327,Atlanta,22208,1323.0,749000,139543,1,0,0
205,4,Woodfield,43,33.822567,-84.409994,30327,Atlanta,22208,1323.0,749000,139543,1,0,0
204,4,Whitewater Creek,60,33.872115,-84.440229,30327,Atlanta,22208,1323.0,749000,139543,1,0,0
203,4,Westminster/Milmar,66,33.836709,-84.4248,30327,Atlanta,22208,1323.0,749000,139543,1,0,0
201,4,Wesley Battle,85,33.826839,-84.430624,30327,Atlanta,22208,1323.0,749000,139543,1,0,0
199,4,Pleasant Hill,52,33.856998,-84.437782,30327,Atlanta,22208,1323.0,749000,139543,1,0,0
198,4,Paces,494,33.852113,-84.442315,30327,Atlanta,22208,1323.0,749000,139543,1,0,0
197,4,Mt. Paran/Northside,312,33.86709,-84.423675,30327,Atlanta,22208,1323.0,749000,139543,1,0,0
196,4,Mt. Paran Parkway,12,33.875465,-84.419089,30327,Atlanta,22208,1323.0,749000,139543,1,0,0
195,4,Margaret Mitchell,138,33.832218,-84.436586,30327,Atlanta,22208,1323.0,749000,139543,1,0,0


## Discussion 

Based on everything that we have seen, Castlewood is the best place to open up a new Mexican Restaurant

## Conclusion

Anywhere in Atlanta would be a great place to set up a new Mexican Restaurant but **Castlewood** would be the best place. It has low crime, a good-sized population, a high medfian household income, and is in the best cluster for Mexican Restaurants. A caveat is that home prices are high so rent is likely to be expensive, but the other features are likely to make up for it. In the future, more clusters may be useful. 