# Capstone Project - The Battle of the Neighborhoods 
### Applied Data Science Capstone by IBM/Coursera

## Introduction
In this project, we will try to find an optimal location for a Chinese restaurant in Indianapolis, the capital of Indiana. The report is specifically designed for stakeholders who are interested in this investment. We would like to detect locations based on two aspects. 

1. Distance to the nearest Chinese restaurant
2. Restaurant clusters/ availability of other non-Chinese restaurant nearby

We expect to generate one or two most promising neighborhoods based on the above criteria after data science implement. Advantages for each area will then be expressed so that the best possible final location can be chosen by stakeholders.

## Data
  
We will collect the following data criteria:
1. number of Chinese restaurants and distance to each other in the neighborhood;
2. due to the limited number of chinese restaurant in Indianapolis, we will go through all the restaurant clusters and look for cluster *without* any Chinese restaurant. 

The restaurants' types and location in every neighborhood will be obtained using Foursquare API.

### Neighborhood Candidates
First, find the latitude & longitude location of the center of Indianapolis, using specific, well known address (1 monumnet circle) and Google Maps geocoding API.

In [11]:
import requests

def get_coordinates(api_key, address, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(api_key, address)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        geographical_data = results[0]['geometry']['location'] # get geographical coordinates
        lat = geographical_data['lat']
        lon = geographical_data['lng']
        return [lat, lon]
    except:
        return [None, None]
    
address = '1 Monument Cir, Indianapolis, IN'
google_api_key = 'AIzaSyDAFNgeFXwgYvjylQRpoKkrMO6PnIg0I9o'
indy_center = get_coordinates(google_api_key, address)
print('Coordinate of {}: {}'.format(address, indy_center))

Coordinate of 1 Monument Cir, Indianapolis, IN: [39.767884, -86.15729139999999]


After obtaining the geometric location of the city center of Indianapolis, we need to transform its location to Cartesian 2D coordinate system which allows us to calculate distances in meters (not in latitude/longitude degrees). Thus, a functions that converts between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in  meters) was created.

In [12]:
import sys
print(sys.path)

!pip install shapely
import shapely.geometry

!pip install pyproj
import pyproj

import math

def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('Indy center longitude={}, latitude={}'.format(indy_center[1], indy_center[0]))
x, y = lonlat_to_xy(indy_center[1], indy_center[0])
print('Indy center UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('Indy center longitude={}, latitude={}'.format(lo, la))

['/home/jupyterlab/conda/envs/python/lib/python36.zip', '/home/jupyterlab/conda/envs/python/lib/python3.6', '/home/jupyterlab/conda/envs/python/lib/python3.6/lib-dynload', '', '/home/jupyterlab/conda/envs/python/lib/python3.6/site-packages', '/home/jupyterlab/conda/envs/python/lib/python3.6/site-packages/IPython/extensions', '/home/jupyterlab/.ipython']
Coordinate transformation check
-------------------------------
Indy center longitude=-86.15729139999999, latitude=39.767884
Indy center UTM X=-5766855.238421603, Y=11452926.307400493
Indy center longitude=-86.15729140000045, latitude=39.767884000000976


Here, we create the centroids of the grid network. A grid of cells covering our area of interest which is aprox. 12x12 killometers centered around Indianapolis city center (12000m x 12000m). To create adjacent hexagonal grid cells, adjacent centroids needs to be sqrt(3)/2 units away on y direction and 3/2 units away on x direction. We selected 600 meters as one unit at this stage. Thus, there will be approcimately 20 x 20 cells created.


In [13]:
indy_center_x, indy_center_y = lonlat_to_xy(indy_center[1], indy_center[0]) # City center in Cartesian coordinates

k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = indy_center_x - 6000

x_step = 600
y_min = indy_center_y - 6000 - (int(21/k)*k*600 - 12000)/2
y_step = 600 * k 

latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []
for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 300 if i%2==0 else 0
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(indy_center_x, indy_center_y, x, y)
        if (distance_from_center <= 6001):
            lon, lat = xy_to_lonlat(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

print(len(latitudes), 'candidate neighborhood centers generated.')

364 candidate neighborhood centers generated.


Here, we visualize the cell centroids (or neighborhood centers) just generated.

In [14]:
import folium

In [15]:
map_indy = folium.Map(location=indy_center, zoom_start=13)
folium.Marker(indy_center, popup='Monument').add_to(map_indy)
for lat, lon in zip(latitudes, longitudes):
    folium.Circle([lat, lon], radius=300, color='blue', fill=False).add_to(map_indy)
map_indy

We now have the coordinates of centers of neighborhoods/areas to be evaluated, equally spaced (distance from every point to it's neighbors is exactly the same) and within ~6km from the city center. 

Using Google Maps API, we can get approximate addresses of those locations. After getting these addresses, we check one of them to make sure the code works.

In [16]:
def get_address(api_key, latitude, longitude, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&latlng={},{}'.format(api_key, latitude, longitude)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        address = results[0]['formatted_address']
        return address
    except:
        return None

addr = get_address(google_api_key, indy_center[0], indy_center[1])
print('Reverse geocoding check')
print('-----------------------')
print('Address of [{}, {}] is: {}'.format(indy_center[0], indy_center[1], addr))

Reverse geocoding check
-----------------------
Address of [39.767884, -86.15729139999999] is: 49 Monument Cir, Indianapolis, IN 46204, USA


We now start to obtain the address for these cell centroids. We remove the country name (USA) from the address, since this information is trivial. 

In [17]:
print('Obtaining location addresses: ', end='')
addresses = []
for lat, lon in zip(latitudes, longitudes):
    address = get_address(google_api_key, lat, lon)
    if address is None:
        address = 'NO ADDRESS'
    address = address.replace(', USA', '')
    addresses.append(address)
    print(' .', end='')
print(' done.')

Obtaining location addresses:  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . done.


In [18]:
addresses[150:170]

['116 W 21st St, Indianapolis, IN 46202',
 '2249 N Capitol Ave, Indianapolis, IN 46208',
 '261 W 25th St, Indianapolis, IN 46208',
 '2146 Barth Ave, Indianapolis, IN 46203',
 'Beecher St & E. Pleasant Run Pkwy N. Dr., Indianapolis, IN 46203',
 '916 E Minnesota St, Indianapolis, IN 46203',
 '1407 Wright St, Indianapolis, IN 46203',
 '719 Prospect St, Indianapolis, IN 46203',
 '845 Greer St, Indianapolis, IN 46203',
 '586-630 S East St, Indianapolis, IN 46225, Indianapolis, IN 46203',
 '331 Virginia Ave, Indianapolis, IN 46204',
 '1051452, Indianapolis, IN 46204',
 '222 E Market St, Indianapolis, IN 46204',
 '332 N Delaware St, Indianapolis, IN 46204',
 '605 N Pennsylvania St, Indianapolis, IN 46204',
 '842 N Meridian St, Indianapolis, IN 46204',
 '1191 N Illinois St, Indianapolis, IN 46204',
 '131 W 14th St, Indianapolis, IN 46202',
 '1604 N Capitol Ave, Indianapolis, IN 46202',
 '1901 N Senate Ave, Indianapolis, IN 46202']

A pandas dataframe is created to store all the information we just obtained.

In [19]:
import pandas as pd

df_locations = pd.DataFrame({'Address': addresses,
                             'Latitude': latitudes,
                             'Longitude': longitudes,
                             'X': xs,
                             'Y': ys,
                             'Distance from center': distances_from_center})

df_locations.head(10)

Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center
0,"37 S Gray St, Indianapolis, IN 46201",39.767598,-86.111249,-5768655.0,11447210.0,5992.495307
1,"231 N Oakland Ave, Indianapolis, IN 46201",39.771,-86.112595,-5768055.0,11447210.0,5840.3767
2,"2919 E Michigan St, Indianapolis, IN 46201",39.774402,-86.113942,-5767455.0,11447210.0,5747.173218
3,"653 N Oxford St, Indianapolis, IN 46201",39.777805,-86.115289,-5766855.0,11447210.0,5715.767665
4,"2801 E 10th St, Indianapolis, IN 46201",39.781208,-86.116637,-5766255.0,11447210.0,5747.173218
5,"1233 N Temple Ave, Indianapolis, IN 46201",39.784611,-86.117984,-5765655.0,11447210.0,5840.3767
6,"2506 E 16th St, Indianapolis, IN 46201",39.788015,-86.119332,-5765055.0,11447210.0,5992.495307
7,"402 S Oakland Ave, Indianapolis, IN 46201",39.761595,-86.113046,-5769555.0,11447730.0,5855.766389
8,"2928 Newton Ave, Indianapolis, IN 46201",39.764997,-86.114393,-5768955.0,11447730.0,5604.462508
9,"2817 E Washington St, Indianapolis, IN 46201",39.768399,-86.11574,-5768355.0,11447730.0,5408.326913


Now we save this data into local file (locations.pkl).

In [20]:
df_locations.to_pickle('./locations.pkl')   

## Foursquare

Now that we have our location candidates, let's use Foursquare API to get info on restaurants in each neighborhood.

We're interested in venues in 'food' category, but only those that are proper restaurants - coffee shops, pizza places, bakeries etc. are not direct competitors so we don't care about those. So we will include in out list only venues that have 'restaurant' in category name, and we'll make sure to detect and include all the subcategories of specific 'Chinese restaurant' category, as we need info on Chinese restaurants in the neighborhood.

In [21]:
CLIENT_ID = 'ZNNYVMI0X4PX4A1IFNSOLSSB2MLFHVCRVHNZMIO3HEPE1EES' # your Foursquare ID
CLIENT_SECRET = '5N3IR0Z445RJZ1QSNKWQMAJBSG5QHUQBO3XX1R5OXENBRHTG' # your Foursquare Secret
ACCESS_TOKEN = 'CW3T203FHEBNWWLQNJRRIUI1XWHGWFTDQXYZYCM5RPM43L3K' # your FourSquare Access Token
VERSION = '20210102'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ZNNYVMI0X4PX4A1IFNSOLSSB2MLFHVCRVHNZMIO3HEPE1EES
CLIENT_SECRET:5N3IR0Z445RJZ1QSNKWQMAJBSG5QHUQBO3XX1R5OXENBRHTG


In [22]:
food_category = '4d4b7105d754a06374d81259'

chinese_restaurant_categories = ['4bf58dd8d48988d145941735','52af3a5e3cf9994f4e043bea','52af3a723cf9994f4e043bec',
                                 '52af3a7c3cf9994f4e043bed','58daa1558bbb0b01f18ec1d3','52af3a673cf9994f4e043beb',
                                 '52af3a903cf9994f4e043bee','4bf58dd8d48988d1f5931735','52af3a9f3cf9994f4e043bef',
                                 '52af3aaa3cf9994f4e043bf0','52af3ab53cf9994f4e043bf1','52af3abe3cf9994f4e043bf2',
                                 '52af3ac83cf9994f4e043bf3','52af3ad23cf9994f4e043bf4','52af3add3cf9994f4e043bf5',
                                 '52af3add3cf9994f4e043bf7','52af3add3cf9994f4e043bf6','52af3add3cf9994f4e043bf8',
                                 '52af3add3cf9994f4e043bf9','52af3b213cf9994f4e043bfa','52af3b293cf9994f4e043bfb',
                                 '52af3b343cf9994f4e043bfc','52af3b3b3cf9994f4e043bfd','52af3b463cf9994f4e043bfe',
                                 '52af3b633cf9994f4e043c01','52af3b513cf9994f4e043bff','52af3b593cf9994f4e043c00',
                                 '52af3b6e3cf9994f4e043c02','52af3b773cf9994f4e043c03','52af3b813cf9994f4e043c04',
                                 '52af3b893cf9994f4e043c05','52af3b913cf9994f4e043c06','52af3b9a3cf9994f4e043c07',
                                 '52af3ba23cf9994f4e043c08']


def is_restaurant(categories, specific_filter=None):
    restaurant_words = ['restaurant', 'diner', 'taverna', 'steakhouse']
    restaurant = False
    specific = False
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        for r in restaurant_words:
            if r in category_name:
                restaurant = True
        if 'fast food' in category_name:
            restaurant = False
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            restaurant = True
    return restaurant, specific

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', Deutschland', '')
    address = address.replace(', Germany', '')
    return address

def get_venues_near_location(lat, lon, category, CLIENT_ID, CLIENT_SECRET, radius=500, limit=100):
    version = '20210102'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        CLIENT_ID, CLIENT_SECRET, VERSION, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
    return venues

In [23]:
import pickle

def get_restaurants(lats, lons):
    restaurants = {}
    chinese_restaurants = {}
    location_restaurants = []

    print('Obtaining venues around candidate locations:', end='')
    for lat, lon in zip(lats, lons):
        # Using radius=350 to meke sure we have overlaps/full coverage so we don't miss any restaurant (we're using dictionaries to remove any duplicates resulting from area overlaps)
        venues = get_venues_near_location(lat, lon, food_category, CLIENT_ID, CLIENT_SECRET, radius=350, limit=100)
        area_restaurants = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            is_res, is_chinese = is_restaurant(venue_categories, specific_filter=chinese_restaurant_categories)
            if is_res:
                x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_chinese, x, y)
                if venue_distance<=400:
                    area_restaurants.append(restaurant)
                restaurants[venue_id] = restaurant
                if is_chinese:
                    chinese_restaurants[venue_id] = restaurant
        location_restaurants.append(area_restaurants)
        print(' .', end='')
    print(' done.')
    return restaurants, chinese_restaurants, location_restaurants

# Try to load from local file system in case we did this before
restaurants = {}
chinese_restaurants = {}
location_restaurants = []
loaded = False
try:
    with open('restaurants_350.pkl', 'rb') as f:
        restaurants = pickle.load(f)
    with open('chinese_restaurants_350.pkl', 'rb') as f:
        chinese_restaurants = pickle.load(f)
    with open('location_restaurants_350.pkl', 'rb') as f:
        location_restaurants = pickle.load(f)
    print('Restaurant data loaded.')
    loaded = True
except:
    pass

# If load failed use the Foursquare API to get the data
if not loaded:
    restaurants, chinese_restaurants, location_restaurants = get_restaurants(latitudes, longitudes)
    
    # Let's persists this in local file system
    with open('restaurants_350.pkl', 'wb') as f:
        pickle.dump(restaurants, f)
    with open('chinese_restaurants_350.pkl', 'wb') as f:
        pickle.dump(chinese_restaurants, f)
    with open('location_restaurants_350.pkl', 'wb') as f:
        pickle.dump(location_restaurants, f)

Restaurant data loaded.


In [24]:
import numpy as np

print('Total number of restaurants:', len(restaurants))
print('Total number of Chinese restaurants:', len(chinese_restaurants))
print('Percentage of Chinese restaurants: {:.2f}%'.format(len(chinese_restaurants) / len(restaurants) * 100))
print('Average number of restaurants in neighborhood:', np.array([len(r) for r in location_restaurants]).mean())

Total number of restaurants: 195
Total number of Chinese restaurants: 5
Percentage of Chinese restaurants: 2.56%
Average number of restaurants in neighborhood: 1.0137362637362637


In [25]:
print('List of all restaurants')
print('-----------------------')
for r in list(restaurants.values())[:10]:
    print(r)
print('...')
print('Total:', len(restaurants))

List of all restaurants
-----------------------
('4d41b4b989616dcbc8da11b5', 'Mexican Inn', 39.77443917849138, -86.11773216303911, '2639 E Michigan St (Rural), Indianapolis, IN 46201, United States', 298, False, -5767305.218703572, 11447680.39701945)
('4f32398a19836c91c7c2753c', 'Mexican in', 39.774291, -86.117829, '2639 E Michigan St, Indianapolis, IN 46201, United States', 284, False, -5767325.434018949, 11447699.750650108)
('4e6ccfb2b993061ea8f88013', 'Birrieria Ocotlan', 39.760660597303016, -86.11281898571718, 'Indianapolis, IN, United States', 342, False, -5769714.584651757, 11447747.901628207)
('4fae9f05e4b08a88c5dac898', 'Octolan', 39.760195, -86.116603, 'Indianapolis, IN 46201, United States', 257, False, -5769645.728049673, 11448241.986343428)
('4dd5e653fa76ad96d0fff981', 'Taco Stand', 39.76583333333333, -86.11749999999999, 'Indianapolis, IN 46201, United States', 281, False, -5768702.079484412, 11448075.681740649)
('51a62d0f498ea14078608597', 'Tlaolli', 39.76868964405419, -86

In [26]:
print('List of Chinese restaurants')
print('---------------------------')
for r in list(chinese_restaurants.values())[:10]:
    print(r)
print('...')
print('Total:', len(chinese_restaurants))

List of Chinese restaurants
---------------------------
('4bdf6323ffdec92834ddeba1', "General Tso's Inn", 39.75825602108911, -86.11430403622245, '642 Twin Aire Dr (Southeastern), Indianapolis, IN 46203, United States', 181, True, -5770046.008648187, 11448051.337766806)
('4c01a236f8492d7ff0325ffa', 'Hong Kong Restaurant', 39.787688342281854, -86.15964141562343, '1524 N Illinois St (at Rankin St), Indianapolis, IN 46202, United States', 236, True, -5763573.393101967, 11452239.519477528)
('4b1457f9f964a5208ca123e3', 'China King', 39.76949863249099, -86.15440001269685, '148 N Delaware St (btw Wabash and Ohio), Indianapolis, IN 46204, United States', 305, True, -5766705.252673871, 11452486.756355537)
('4b4a3be1f964a520f67f26e3', "P.F. Chang's", 39.7666047, -86.1595627, '49 W Maryland St Ste 226 (at Circle Centre Mall), Indianapolis, IN 46204, United States', 138, True, -5766974.796359616, 11453272.153657021)
('4ca218d5542b224b670e14a0', 'Panda Express', 39.781281192790146, -86.1703093524892

In [28]:
print('Restaurants around location')
print('---------------------------')
for i in range(100, 110):
    rs = location_restaurants[i][:8]
    names = ', '.join([r[1] for r in rs])
    print('Restaurants around location {}: {}'.format(i+1, names))

Restaurants around location
---------------------------
Restaurants around location 101: YO! Sushi
Restaurants around location 102: Geraldine’s Supper Club & Lounge
Restaurants around location 103: 
Restaurants around location 104: Second Helpings, Moto Cafe, Sanitary Diner
Restaurants around location 105: Ralph's Great Divide, Sanitary Diner
Restaurants around location 106: Ralph's Great Divide, H R H
Restaurants around location 107: livery, Mesh, thaitanium, Forty Five Degrees
Restaurants around location 108: Yats, Mimi Blue Meatballs, Sultana Cafe & Hookah Bar, Forty Five Degrees, Love Handle
Restaurants around location 109: Sultana Cafe & Hookah Bar
Restaurants around location 110: Tinker Street


Let's now see all the collected restaurants in our area of interest on map, and let's also show Chinese restaurants in different color. As we already know, there are only 5 Chinese restaurant in this area (in red).

In [29]:
map_indy = folium.Map(location=indy_center, zoom_start=13)
folium.Marker(indy_center, popup='Monument').add_to(map_indy)
for res in restaurants.values():
    lat = res[2]; lon = res[3]
    is_chinese = res[6]
    color = 'red' if is_chinese else 'blue'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_indy)
map_indy

Looking good. So now we have all the restaurants in area within few kilometers from the city center, and we know which ones are Chinese restaurants. We also know which restaurants exactly are in vicinity of every neighborhood candidate center.

This concludes the data gathering phase - we're now ready to use this data for analysis to produce the report on optimal locations for a new Chinese restaurant! We will look for a location that is far from a Chinese restaurant and be with a cluster of other restaurants, as people often go to a region with a good selection of restaurant for food. 

## Methodology
In this project we will direct our efforts on detecting areas of Indianapolis that have low restaurant density, particularly those with low number of Italian restaurants. We will limit our analysis to area ~6km around city center.

In first step we have collected the required **data: location and type (category) of every restaurant within 6km from Indy center** . We have also **identified Chinese restaurants** (according to Foursquare categorization).

In second step we will evaluate all the clusters of restaurants using **k-means** algorithm. We will then identify an ideal location for a new Chinese restaurant that's far from other Chinese restaurant and at same time within one of the restaurant cluster. We will present map of all such locations, which should be a good starting point for final "street level" exploration which should be presented by srtakeholders after this analysis. 

## Analysis

First, we perform some exploratory data, for example, the number of restaurant in every area candidate.


In [30]:
location_restaurants_count = [len(res) for res in location_restaurants]

df_locations['Restaurants in area'] = location_restaurants_count

print('Average number of restaurants in every area with radius=300m:', np.array(location_restaurants_count).mean())

df_locations.head(10)

Average number of restaurants in every area with radius=300m: 1.0137362637362637


Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center,Restaurants in area
0,"37 S Gray St, Indianapolis, IN 46201",39.767598,-86.111249,-5768655.0,11447210.0,5992.495307,0
1,"231 N Oakland Ave, Indianapolis, IN 46201",39.771,-86.112595,-5768055.0,11447210.0,5840.3767,0
2,"2919 E Michigan St, Indianapolis, IN 46201",39.774402,-86.113942,-5767455.0,11447210.0,5747.173218,0
3,"653 N Oxford St, Indianapolis, IN 46201",39.777805,-86.115289,-5766855.0,11447210.0,5715.767665,0
4,"2801 E 10th St, Indianapolis, IN 46201",39.781208,-86.116637,-5766255.0,11447210.0,5747.173218,0
5,"1233 N Temple Ave, Indianapolis, IN 46201",39.784611,-86.117984,-5765655.0,11447210.0,5840.3767,0
6,"2506 E 16th St, Indianapolis, IN 46201",39.788015,-86.119332,-5765055.0,11447210.0,5992.495307,0
7,"402 S Oakland Ave, Indianapolis, IN 46201",39.761595,-86.113046,-5769555.0,11447730.0,5855.766389,1
8,"2928 Newton Ave, Indianapolis, IN 46201",39.764997,-86.114393,-5768955.0,11447730.0,5604.462508,1
9,"2817 E Washington St, Indianapolis, IN 46201",39.768399,-86.11574,-5768355.0,11447730.0,5408.326913,1


Now we calculate the **distance to nearest Chinese restaurant from every area candidate center** (we want distance to closest one, regardless of how distant it is).

In [31]:
distances_to_chinese_restaurant = []

for area_x, area_y in zip(xs, ys):
    min_distance = 10000
    for res in chinese_restaurants.values():
        res_x = res[7]
        res_y = res[8]
        d = calc_xy_distance(area_x, area_y, res_x, res_y)
        if d<min_distance:
            min_distance = d
    distances_to_chinese_restaurant.append(min_distance)

df_locations['Distance to Chinese restaurant'] = distances_to_chinese_restaurant

In [32]:
df_locations.head(10)

Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center,Restaurants in area,Distance to Chinese restaurant
0,"37 S Gray St, Indianapolis, IN 46201",39.767598,-86.111249,-5768655.0,11447210.0,5992.495307,0,1625.17173
1,"231 N Oakland Ave, Indianapolis, IN 46201",39.771,-86.112595,-5768055.0,11447210.0,5840.3767,0,2161.043133
2,"2919 E Michigan St, Indianapolis, IN 46201",39.774402,-86.113942,-5767455.0,11447210.0,5747.173218,0,2723.789951
3,"653 N Oxford St, Indianapolis, IN 46201",39.777805,-86.115289,-5766855.0,11447210.0,5715.767665,0,3299.690284
4,"2801 E 10th St, Indianapolis, IN 46201",39.781208,-86.116637,-5766255.0,11447210.0,5747.173218,0,3882.895857
5,"1233 N Temple Ave, Indianapolis, IN 46201",39.784611,-86.117984,-5765655.0,11447210.0,5840.3767,0,4470.548569
6,"2506 E 16th St, Indianapolis, IN 46201",39.788015,-86.119332,-5765055.0,11447210.0,5992.495307,0,5061.099563
7,"402 S Oakland Ave, Indianapolis, IN 46201",39.761595,-86.113046,-5769555.0,11447730.0,5855.766389,1,586.526896
8,"2928 Newton Ave, Indianapolis, IN 46201",39.764997,-86.114393,-5768955.0,11447730.0,5604.462508,1,1137.074347
9,"2817 E Washington St, Indianapolis, IN 46201",39.768399,-86.11574,-5768355.0,11447730.0,5408.326913,1,1721.006201


In [33]:
print('Average distance to closest Chinese restaurant from each area center:', df_locations['Distance to Chinese restaurant'].mean())

Average distance to closest Chinese restaurant from each area center: 2705.2949736238998


OK, so **on average Chinese restaurant can be found within ~3000 m** from every area center candidate. That's fairly far away, this makes our job a little easier. 

In [1]:
!pip install beautifulsoup4
!pip install lxml

import requests 
import pandas as pd 
import numpy as np 
import random

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/d1/41/e6495bd7d3781cee623ce23ea6ac73282a373088fcd0ddc809a047b18eae/beautifulsoup4-4.9.3-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 7.5MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2; python_version >= "3.0" (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/02/fb/1c65691a9aeb7bd6ac2aa505b84cb8b49ac29c976411c6ab3659425e045f/soupsieve-2.1-py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.3 soupsieve-2.1
Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/bd/78/56a7c88a57d0d14945472535d0df9fb4bbad7d34ede658ec7961635c790e/lxml-4.6.2-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 7.7MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.6.2


In [2]:
!pip install geopy

from geopy.geocoders import Nominatim

from IPython.display import Image 
from IPython.core.display import HTML 

from IPython.display import display_html
import pandas as pd
import numpy as np

from pandas.io.json import json_normalize

!pip install folium
import folium 
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/0c/67/915668d0e286caa21a1da82a85ffe3d20528ec7212777b43ccd027d94023/geopy-2.1.0-py3-none-any.whl (112kB)
[K     |████████████████████████████████| 112kB 6.1MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.1.0


In [45]:
map_indy2 = folium.Map(location=[39.767884, -86.15729139999999],zoom_start=10)

for lat,lng, address in zip(df_locations['Latitude'],df_locations['Longitude'],df_locations['Address']):
    label = '{}'.format(address)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_indy2)
map_indy2

In [46]:
# set number of clusters
kclusters = 5

indy_clustering = df_locations.drop(['Address','X','Y','Distance from center', 'Restaurants in area','Distance to Chinese restaurant'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(indy_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([2, 2, 2, 2, 3, 3, 3, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2,
       2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3,
       2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 1, 1,
       1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 1, 1, 1,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4,

In [51]:
# create map
map_indy3 = folium.Map(location=indy_center, zoom_start=13)
folium.Marker(indy_center, popup='Monument').add_to(map_indy3)
for res in restaurants.values():
    lat = res[2]; lon = res[3]
    is_chinese = res[6]
    color = 'yellow' if is_chinese else 'blue'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_indy3)

df_locations.insert(0, 'Cluster Labels B', kmeans.labels_)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, address, cluster in zip(df_locations['Latitude'], df_locations['Longitude'], df_locations['Address'], df_locations['Cluster Labels']):
    label = folium.Popup(str(address) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_indy3)
       
map_indy3

## Conclusion

The purpose of this project is to find an area in Indianapolis to open a Chinese restaurant.

After fetching data from several data sources and process them into a clean data frame, applying the K-Means clustering algorithm, we picked the cluster with more restaurnt and fewer Chinese restaurant on average. By sorting all candidate areas in the cluster, we get the most 5 promising zones which are used as starting points for final exploration by stakeholders.

The final decision on optimal Chinese restaurant’s location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like the parking lot of each location, traffic of existing Chinese restaurants in the cluster, and current revenue of them, etc.