# Table of content
* [Business problem](#h1)
* [Preparation](#h2)
* [Loading list of Yoga Studios](#h3)
* [Analyzing next venues](#h4)
* [Loading venues from TOP-3 next categories](#h5)
* [Generation of points to check](#h6)
* [Clusterization](#h7)
* [Conclusion](#h8)

## Business problem  <a class="anchor" id="h1"></a>

We want to open new Yoga Studio in San Francisco, USA. Relevant place should match next criteria:
* To be at least 500 meters away from any existing Yoga Studio
* To be inside given area on a map
* To be near venues which is popular after visiting Yoga


Specific area on map is determined by given GeoJSON and looks like

In [1]:
import folium

map_sf = folium.Map(location=[37.809, -122.436], zoom_start=12)
folium.GeoJson('./sf.geojson', name='Interested area').add_to(map_sf)
map_sf

## Preparation <a class="anchor" id="h2"></a>
Let's load Foursquare API credentials from config file. 

It it prefferable to use external config instead of writing creadential in plain code because of security reasons - credentials should not be public, especially on github-hosted projects. By such way credential will be stored in kernel memory during runtime, but not be stored in any part of notebook code.

In [2]:
import json
import requests
import os.path

with open('./data/foursquare_credentials.json') as f:
    data = json.load(f)
    CLIENT_ID = data['client_id']
    CLIENT_SECRET = data['client_secret']
    VERSION = '20180604'
    print('Credentials loaded')

Credentials loaded


Let's define method **loadOrGetCached** for loading cached server response from disk if available, cause Foursquare API has daily quotas on requests. If no cache is available, function will perform network request and will save response in cache for future usages.

Also we define functions **urlSearch**, **urlDetails** and **urlNextVenues** for getting appropriate Foursquare API calls.

In [3]:
def loadOrGetCached(url, fileName):        
    if os.path.isfile(fileName):   
        with open(fileName) as f:
            results = json.load(f)
    else:
        dirName = os.path.dirname(fileName)
        
        if not os.path.exists(dirName):
            os.makedirs(dirName)
        
        response = requests.get(url)
        if response.ok:        
            results = response.json()
            with open(fileName, 'w') as f:
                json.dump(results, f)
        else:
            print('error: {} at loading {}'.format(response.status_code, url))
            return None
        
    return results

def urlSearch(latitude, longitude, categoryId):
    return 'https://api.foursquare.com/v2/venues/search?' \
                'client_id={}&client_secret={}&v={}' \
                '&ll={},{}&categoryId={}&limit=50' \
                .format(CLIENT_ID, CLIENT_SECRET, VERSION, latitude, longitude, categoryId)

def urlDetails(vanueID):
    return 'https://api.foursquare.com/v2/venues/{}?' \
           'client_id={}&client_secret={}&v={}'\
        .format(venueID, CLIENT_ID, CLIENT_SECRET, VERSION)

def urlNextVenues(vanueID):
    return 'https://api.foursquare.com/v2/venues/{}/nextvenues?' \
           'client_id={}&client_secret={}&v={}'\
        .format(vanueID, CLIENT_ID, CLIENT_SECRET, VERSION)

print('Urls defined')

Urls defined


Because Foursquare API function for searching venues of given category returns all matching venues in some radius around specified point, it is possible that not all city venues will be returned. Especially if city is large enough and/or have more than 50 venues of given category. So we split city area in rectangular areas for performing our requests.

For this task we will split area by 2 'rows' and 3 'columns'.

In [4]:
import pandas as pd
import numpy as np
from tqdm import tqdm

tqdm.pandas()

boundaries = np.array([[37.809, -122.536], [37.708, -122.349]])

numSquaresLat = 2
numSquaresLon = 3

latPoints = np.linspace(boundaries[0,0], boundaries[1,0], numSquaresLat+1)
lonPoints = np.linspace(boundaries[0,1], boundaries[1,1], numSquaresLon+1)
print(latPoints)
print(lonPoints)

map_sf = folium.Map(location=boundaries.mean(axis=0), zoom_start=12)

folium.Rectangle(bounds=boundaries, 
                 fill = True,
                 fill_color='blue',
                 fill_opacity=0).add_to(map_sf)

areasCenters = []

for x in range(numSquaresLat):
    for y in range(numSquaresLon):
        folium.Rectangle(bounds=([latPoints[x], lonPoints[y]], [latPoints[x+1], lonPoints[y+1]]), 
                 fill = True,
                 fill_color='yellow',
                 fill_opacity=0.1).add_to(map_sf)
        areasCenters.append([np.mean([latPoints[x], latPoints[x+1]]), np.mean([lonPoints[y], lonPoints[y+1]])])

map_sf


[37.809  37.7585 37.708 ]
[-122.536      -122.47366667 -122.41133333 -122.349     ]


## Loading list of Yoga Studios <a class="anchor" id="h3"></a>
Let's already load all Yoga Studios in San Francisco!

In [5]:
from pandas.io.json import json_normalize

yogaCategoryId = '4bf58dd8d48988d102941735'

dfYogaStudios = None

for area in areasCenters:
    lat = area[0]
    lon = area[1]
    fileName = './data/list/yoga/{:.4f}_{:.4f}.json'.format(lat, lon)
    data = loadOrGetCached(urlSearch(lat, lon, yogaCategoryId), fileName)
    
    df = json_normalize(data['response']['venues'])
    
    if dfYogaStudios is None :
        dfYogaStudios = df
    else:
        dfYogaStudios = dfYogaStudios.append(df, ignore_index=True)
        
print(dfYogaStudios.shape)
print(dfYogaStudios.head())


(300, 19)
                                          categories  hasPerk  \
0  [{'id': '503289d391d4c4b30a586d6a', 'name': 'C...    False   
1  [{'id': '503289d391d4c4b30a586d6a', 'name': 'C...    False   
2  [{'id': '4bf58dd8d48988d102941735', 'name': 'Y...    False   
3  [{'id': '4bf58dd8d48988d102941735', 'name': 'Y...    False   
4  [{'id': '4bf58dd8d48988d176941735', 'name': 'G...    False   

                         id        location.address location.cc  \
0  49d14d76f964a520755b1fe3        924 Old Mason St          US   
1  4690d81af964a520a1481fe3      100 El Camino Real          US   
2  5ba43ed51f7440002cba3098  150 Van Ness Ave Ste A          US   
3  5b6ba7e5a42362002c0899ab     100 Church St Ste A          US   
4  4ab81d6ff964a5203d7c20e3      3200 California St          US   

   location.city location.country location.crossStreet  location.distance  \
0  San Francisco    United States                  NaN               3951   
1        Belmont    United States         

We got 300 Yoga Studios - looks to much. Of course where will be duplicates due to overlapping. Let's drop duplicates based on 'id' field.

In [6]:
dfYogaStudios.drop_duplicates(subset=['id'], keep='first', inplace=True)
print(dfYogaStudios.shape)

dfYogaStudios.drop(columns=['hasPerk','location.cc','location.country','location.crossStreet',
                            'location.distance','location.formattedAddress', 'location.labeledLatLngs',
                            'location.neighborhood', 'location.postalCode', 'location.state', 'location.city',
                            'categories','referralId','venuePage.id'], inplace=True)
print(dfYogaStudios.head())

(50, 19)
                         id        location.address  location.lat  \
0  49d14d76f964a520755b1fe3        924 Old Mason St     37.804408   
1  4690d81af964a520a1481fe3      100 El Camino Real     37.529105   
2  5ba43ed51f7440002cba3098  150 Van Ness Ave Ste A     37.777012   
3  5b6ba7e5a42362002c0899ab     100 Church St Ste A     37.769317   
4  4ab81d6ff964a5203d7c20e3      3200 California St     37.787215   

   location.lng                name  
0   -122.468304      Planet Granite  
1   -122.287567      Planet Granite  
2   -122.419531      CorePower Yoga  
3   -122.429319      CorePower Yoga  
4   -122.447671  JCC Fitness Center  


Ok, just 50 studios left. Much better.

Let's visualize them on map.

In [7]:
map_sf = folium.Map(location=boundaries.mean(axis=0), zoom_start=11)

folium.Rectangle(bounds=boundaries, 
                 fill = True,
                 fill_color='blue',
                 fill_opacity=0).add_to(map_sf)

for lat, lng, label in zip(dfYogaStudios['location.lat'], dfYogaStudios['location.lng'], dfYogaStudios.name):
    folium.Marker(
        [lat, lng],
        popup=label
    ).add_to(map_sf)

map_sf


Unfortunatelly, some of Studios are outside of San Francisco area, marked by blue rectangle area.

Let's drop all record outside of this rectangle.

In [8]:
dfYogaStudios.drop( dfYogaStudios[dfYogaStudios['location.lat'] < boundaries[:,0].min()].index, inplace=True )
dfYogaStudios.drop( dfYogaStudios[dfYogaStudios['location.lat'] > boundaries[:,0].max()].index, inplace=True )
dfYogaStudios.drop( dfYogaStudios[dfYogaStudios['location.lng'] < boundaries[:,1].min()].index, inplace=True )
dfYogaStudios.drop( dfYogaStudios[dfYogaStudios['location.lng'] > boundaries[:,1].max()].index, inplace=True )

print(dfYogaStudios.shape)

(33, 5)


In [9]:
map_sf = folium.Map(location=boundaries.mean(axis=0), zoom_start=11)

folium.Rectangle(bounds=boundaries, 
                 fill = True,
                 fill_color='blue',
                 fill_opacity=0).add_to(map_sf)

for lat, lng, label in zip(dfYogaStudios['location.lat'], dfYogaStudios['location.lng'], dfYogaStudios.name):
    folium.Marker(
        [lat, lng],
        popup=label
    ).add_to(map_sf)

map_sf

Alright, just 33 Studios left.

But there is another problem - some of this Yoga Studios doesn't have addresses! Let's fill it by using Nominatum service from geopy library.

Also I will use RateLimiter function for throttling request at no more than 1 request per second limit.

In [10]:
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

geolocator = Nominatim(user_agent="coursera_capstone")
geocode = RateLimiter(geolocator.reverse, min_delay_seconds=1)

def coordsToAddress(row):
    address = row['location.address']
    
    if (pd.isnull(address)):
        coords = '{}, {}'.format(row['location.lat'], row['location.lng'])
        return geocode(coords)[0]
    
    return address

tqdm.pandas()
dfYogaStudios['location.address'] = dfYogaStudios.progress_apply(func=coordsToAddress, axis=1)

100%|██████████| 33/33 [00:04<00:00,  7.15it/s]


Let's view existing Yoga Studios on map once more. 

I will add 500m radius around them and target area zone.

Our task will be find best places inside this zone, which are outside of existing Studios areas.

In [11]:
yogaRadius = 500

map_sf = folium.Map(location=[dfYogaStudios['location.lat'].mean(), dfYogaStudios['location.lng'].mean()], zoom_start=13)

for lat, lng, label in zip(dfYogaStudios['location.lat'], 
                                   dfYogaStudios['location.lng'], 
                                   dfYogaStudios.name):
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        color='blue',
        popup=label
    ).add_to(map_sf)

    folium.Circle(
        location=[lat, lng],
        radius=yogaRadius, 
        weight=0,
        fill=True,
        fill_color='blue',
        fill_opacity=0.3
    ).add_to(map_sf)

folium.GeoJson('./sf.geojson', name='Interested area').add_to(map_sf)
    
map_sf


## Analyzing next venues <a class="anchor" id="h4"></a>
Let's load for each Yoga Studio list of top-5 next venues. 

Also assign for each of them appropriate weight: 5 for first one, 4 for second and so one, with weight=1 for last one.

In [12]:
from geopy.distance import distance

nextVenuesInfo = []

for venueID,lat,lng in zip(dfYogaStudios['id'], dfYogaStudios['location.lat'], dfYogaStudios['location.lng']):
    response = loadOrGetCached(urlNextVenues(venueID),'./data/next_venues/{}.json'.format(venueID))
    venues = response['response']['nextVenues']['items']
        
    index = 0
    for v in venues:        
        v_category = v['categories'][0]['name']
        v_lat = v['location']['lat']
        v_lng = v['location']['lng']
        
        v_distance = distance((lat, lng),(v_lat, v_lng)).m
        nextVenuesInfo.append([venueID, v_category, 5-index, v_lat, v_lng, v_distance])
        index = index + 1
        
dfNextVenues = pd.DataFrame(data=nextVenuesInfo, columns=['yoga_id','category','weight','lat','lng','distance'])
    
print(dfNextVenues.shape)
print(dfNextVenues.head())


(118, 6)
                    yoga_id             category  weight        lat  \
0  49d14d76f964a520755b1fe3  Sporting Goods Shop       5  37.802853   
1  49d14d76f964a520755b1fe3                 Park       4  37.804342   
2  49d14d76f964a520755b1fe3        Grocery Store       3  37.804480   
3  49d14d76f964a520755b1fe3         Burger Joint       2  37.800236   
4  49d14d76f964a520755b1fe3           Taco Place       1  37.800331   

          lng     distance  
0 -122.459306   811.003398  
1 -122.465351   260.192942  
2 -122.432785  3127.970327  
3 -122.439719  2559.675971  
4 -122.440379  2500.609057  


Now let's calculate most popular categories based on cumulative weight

In [13]:
dfGrouped = dfNextVenues.groupby('category')\
    .agg({'category':['count'], 'weight':['sum'],'distance':['mean']}) 
dfGrouped.columns = ["_".join(x) for x in dfGrouped.columns.ravel()]

print('\nby weight',dfGrouped.sort_values(by=['weight_sum'], ascending=False))


by weight                          category_count  weight_sum  distance_mean
category                                                          
Coffee Shop                          23          83     193.936451
Grocery Store                        18          60     650.212217
Bakery                               10          31     373.131759
Juice Bar                             8          22     263.143516
Mexican Restaurant                    5          17     150.522163
Farmers Market                        4          13     511.090607
Salad Place                           4          11     576.237240
Café                                  4          11     203.888523
Supermarket                           2           9     258.901073
Burger Joint                          4           8     757.256102
Souvlaki Shop                         2           8     257.784786
Organic Grocery                       2           7     110.510163
American Restaurant                   3           6

Same result in form of bar chart:

In [14]:
dfGrouped.sort_values(by=['weight_sum'], ascending=True).plot.barh(
    y='weight_sum', 
    figsize=(7,9),
    title='Most popular next venues'
)

<matplotlib.axes._subplots.AxesSubplot at 0x11c28730>

As we can see, most popular next category is Coffee Shops. 

For our project let's choose best plaes based on top-3 categories: Coffee Shops, Grocery Stores and Bakeries.

Now let's calculate mean distance for each of them

In [15]:
meanCoffee = dfGrouped.loc['Coffee Shop']['distance_mean']
meanGrocery = dfGrouped.loc['Grocery Store']['distance_mean']
meanBakery = dfGrouped.loc['Bakery']['distance_mean']

print('Mean distance to next Coffee Shop', meanCoffee)
print('Mean distance to next Grocery Store', meanGrocery)
print('Mean distance to next Bakery', meanBakery)

Mean distance to next Coffee Shop 193.93645127575232
Mean distance to next Grocery Store 650.2122169876337
Mean distance to next Bakery 373.1317593463449


## Loading venues from TOP-3 next categories <a class="anchor" id="h5"></a>
Let's load Coffe Shops, Grocery Stores and Bakeries of San Francisco in the same way as we did before for Yoga Studios.

In [16]:
coffeeShopCategoryId = '4bf58dd8d48988d1e0931735'
groceryStoreCategoryId = '4bf58dd8d48988d118951735' 
bakeryCategoryId = '4bf58dd8d48988d16a941735' 

def loadVenuesList(categoryID, folderName):
    dfVenues = None
    
    for area in areasCenters:
        lat = area[0]
        lon = area[1]
        fileName = './data/list/{}/{:.4f}_{:.4f}.json'.format(folderName, lat, lon)
        data = loadOrGetCached(urlSearch(lat, lon, categoryID), fileName)
        
        df = json_normalize(data['response']['venues'])
        
        if dfVenues is None :
            dfVenues = df
        else:
            dfVenues = dfVenues.append(df, ignore_index=True)
        
    dfVenues.drop_duplicates(subset=['id'], keep='first', inplace=True)
    
    dfVenues.drop( dfVenues[dfVenues['location.lat'] < boundaries[:,0].min()].index, inplace=True )
    dfVenues.drop( dfVenues[dfVenues['location.lat'] > boundaries[:,0].max()].index, inplace=True )
    dfVenues.drop( dfVenues[dfVenues['location.lng'] < boundaries[:,1].min()].index, inplace=True )
    dfVenues.drop( dfVenues[dfVenues['location.lng'] > boundaries[:,1].max()].index, inplace=True )
    
    dfVenues.drop(columns=['location.city'], inplace=True)
    
    print(dfVenues.shape)
    
    dfVenues.drop(columns=['hasPerk','location.cc','location.country','location.crossStreet',
                                'location.distance','location.formattedAddress', 'location.labeledLatLngs',
                                'location.neighborhood', 'location.postalCode', 'location.state',
                                'categories','referralId','venuePage.id'], inplace=True)
    return dfVenues

dfCoffeeShops = loadVenuesList(coffeeShopCategoryId, 'coffee_shops')
dfGroceryStores = loadVenuesList(groceryStoreCategoryId, 'grocery_stores')
dfBakeries = loadVenuesList(bakeryCategoryId, 'bakeries')


(43, 18)
(34, 24)
(40, 24)


Let's show them on map with mean radiuses, which we calculated before.

In [17]:
def drawNextVenues(map, df, radius, color):
    for lat, lng, label in zip(df['location.lat'], 
                                   df['location.lng'], 
                                   df.name):
        folium.CircleMarker(
            [lat, lng],
            radius=2,
            color=color,
            popup=label
        ).add_to(map)
    
        folium.Circle(
            location=[lat, lng],
            radius=radius,
            color=color,
            fill=True,
            fill_color=color,
            weight=0,
            fill_opacity=0.2
        ).add_to(map)

map_sf = folium.Map(location=[dfYogaStudios['location.lat'].mean(), dfYogaStudios['location.lng'].mean()], zoom_start=13)

drawNextVenues(map_sf, dfCoffeeShops, meanCoffee, 'purple')
drawNextVenues(map_sf, dfGroceryStores, meanGrocery, 'green')
drawNextVenues(map_sf, dfBakeries, meanBakery, 'red')

folium.GeoJson('./sf.geojson', name='Interested area').add_to(map_sf)

map_sf


## Generation of points to check <a class="anchor" id="h6"></a>
For checking possible locations, we will generate grid of points with 50m distance between them.

Because target area is non-rectancular, generation will be consists of two phases:
1. Generate grid inside rect of min-max coordinates of target area
2. Exclude point outside of GeoJSON area. For this task we will use *shapely.geometry* library

In [18]:
from shapely.geometry import Polygon, Point

with open('./sf.geojson') as f:
    coordinates = json.load(f)["features"][0]['geometry']['coordinates'][0]    
    polygon = Polygon(coordinates)

c = np.array(coordinates)
minLon = c[:,0].min()
maxLon = c[:,0].max()
minLat = c[:,1].min()
maxLat = c[:,1].max()

distanceLat = distance((minLat, minLon),(maxLat, minLon)).m
distanceLon = distance((minLat, minLon),(minLat, maxLon)).m

INTERVAL = 50

stepsLat = round(distanceLat / INTERVAL)
stepsLon = round(distanceLon / INTERVAL)

latPoints = np.linspace(minLat, maxLat, stepsLat)
lonPoints = np.linspace(minLon, maxLon, stepsLon)

pointsToCheck = []
for lat in latPoints:
    for lon in lonPoints:
        pointsToCheck.append((lat, lon))

print('points in external rectangle', len(pointsToCheck))

points = []
for point in pointsToCheck:
    lon = point[1]
    lat = point[0]
    if polygon.contains(Point(lon, lat)):
        points.append((lat, lon))
        
print('points inside area', len(points))


points in external rectangle 8820
points inside area 5755


Next task: for each point we should find nearest Yoga Studio, Coffee Shop, Groce Store and Bakery. 

For this task we will use *scipy.spatial* library.

Because we are not interested in opening our Yoga Studio closer than 500m to any existing Yoga Studio, we will calculate this distance first and drop non-valid points right there.

In [19]:
from scipy import spatial
pointsTree = spatial.KDTree(points)

yogaCoordinates = list(zip(dfYogaStudios['location.lat'], dfYogaStudios['location.lng']))
coffeeCoordinates = list(zip(dfCoffeeShops['location.lat'], dfCoffeeShops['location.lng']))
groceryCoordinates = list(zip(dfGroceryStores['location.lat'], dfGroceryStores['location.lng']))
bakeryCoordinates = list(zip(dfBakeries['location.lat'], dfBakeries['location.lng']))

yogaTree = spatial.KDTree(yogaCoordinates)
coffeeTree = spatial.KDTree(coffeeCoordinates)
groceryTree = spatial.KDTree(groceryCoordinates)
bakeryTree = spatial.KDTree(bakeryCoordinates)

def getNearest(point, coordinates, tree):
    nearestIndex = tree.query(point)[1]
    return (distance(point, coordinates[nearestIndex]).m, nearestIndex)

pointDistances = []
for p in tqdm(points):
    nearestYoga = getNearest(p, yogaCoordinates, yogaTree)
    if (nearestYoga[0] < yogaRadius): 
        continue
    
    nearestCoffee = getNearest(p, coffeeCoordinates, coffeeTree)
    nearestGrocery = getNearest(p, groceryCoordinates, groceryTree)
    nearestBakery = getNearest(p, bakeryCoordinates, bakeryTree)
    
    pointDistances.append((
        p[0],
        p[1],
        nearestYoga[0],
        dfYogaStudios.iloc[nearestYoga[1]]['id'],
        nearestCoffee[0],
        dfCoffeeShops.iloc[nearestCoffee[1]]['id'],
        nearestGrocery[0],
        dfGroceryStores.iloc[nearestGrocery[1]]['id'],
        nearestBakery[0],
        dfBakeries.iloc[nearestBakery[1]]['id']
    ))

dfPointDistance = pd.DataFrame(data=pointDistances,columns=['point.lat','point.lng',
                                                            'distance.yoga','id.yoga',
                                                            'distance.coffee','id.coffee',
                                                            'distance.grocery','id.grocery',
                                                            'distance.bakery', 'id.bakery',])
print('Valid points',dfPointDistance.shape)
print(dfPointDistance.head())


100%|██████████| 5755/5755 [00:11<00:00, 502.75it/s] 


Valid points (2889, 10)
   point.lat   point.lng  distance.yoga                   id.yoga  \
0  37.768862 -122.407766     577.442348  521d085c498e99760319302a   
1  37.768862 -122.407190     534.041065  521d085c498e99760319302a   
2  37.769316 -122.410641     974.964534  408c5100f964a520c8f21ee3   
3  37.769316 -122.410066     780.511181  521d085c498e99760319302a   
4  37.769316 -122.409491     735.128343  521d085c498e99760319302a   

   distance.coffee                 id.coffee  distance.grocery  \
0       646.433550  58a5d1b2739d851f18993c9b        231.012373   
1       612.456460  58a5d1b2739d851f18993c9b        237.441137   
2       830.717502  57ac23c1498eef0851316d5e        314.948071   
3       839.325603  57ac23c1498eef0851316d5e        300.277402   
4       857.214983  4a4b8a43f964a5206fac1fe3        233.236611   

                 id.grocery  distance.bakery                 id.bakery  
0  459ac90cf964a52088401fe3       856.120625  57ac23c1498eef0851316d5e  
1  459ac90cf964a52

Now we will define scoring formula for each point. it would be simple.

In short: 
* the closer nearest Coffee Shop / Grocery Store / Bakery - the better
* the farther nearest Yoga Studio - the better

In [20]:
def scoreCalculation(row):
    dY = row['distance.yoga']
    dC = row['distance.coffee']
    dG = row['distance.grocery']
    dB = row['distance.bakery']
                    
    koeffYoga = dY / yogaRadius
    
    array = np.array([dC/meanCoffee, dG/meanGrocery, dB/meanBakery])

    return koeffYoga / array.mean()

dfPointDistance['score'] = dfPointDistance.progress_apply(func=scoreCalculation, axis=1)
dfPointDistance['score_scaled'] = (dfPointDistance['score'] - dfPointDistance['score'].min()) / (dfPointDistance['score'].max() - dfPointDistance['score'].min())
print(dfPointDistance.head())

100%|██████████| 2889/2889 [00:00<00:00, 9259.18it/s]


   point.lat   point.lng  distance.yoga                   id.yoga  \
0  37.768862 -122.407766     577.442348  521d085c498e99760319302a   
1  37.768862 -122.407190     534.041065  521d085c498e99760319302a   
2  37.769316 -122.410641     974.964534  408c5100f964a520c8f21ee3   
3  37.769316 -122.410066     780.511181  521d085c498e99760319302a   
4  37.769316 -122.409491     735.128343  521d085c498e99760319302a   

   distance.coffee                 id.coffee  distance.grocery  \
0       646.433550  58a5d1b2739d851f18993c9b        231.012373   
1       612.456460  58a5d1b2739d851f18993c9b        237.441137   
2       830.717502  57ac23c1498eef0851316d5e        314.948071   
3       839.325603  57ac23c1498eef0851316d5e        300.277402   
4       857.214983  4a4b8a43f964a5206fac1fe3        233.236611   

                 id.grocery  distance.bakery                 id.bakery  \
0  459ac90cf964a52088401fe3       856.120625  57ac23c1498eef0851316d5e   
1  459ac90cf964a52088401fe3       879.20

In [21]:
print(dfPointDistance['score'].describe())

count    2889.000000
mean        1.022121
std         0.527197
min         0.277024
25%         0.687990
50%         0.902947
75%         1.144142
max         5.027059
Name: score, dtype: float64


Let's select all point with score higher than 2 as prominent points

In [22]:
dfPointDistance = dfPointDistance[ dfPointDistance['score'] >= 2 ]
print(dfPointDistance.shape)
print(dfPointDistance['score'].describe())

(175, 12)
count    175.000000
mean       2.555680
std        0.574433
min        2.001891
25%        2.170750
50%        2.361780
75%        2.752324
max        5.027059
Name: score, dtype: float64


And show them on map

In [23]:
map_sf = folium.Map(location=[latPoints.mean(), lonPoints.mean()], zoom_start=13)

folium.GeoJson('./sf.geojson', name='Interested area').add_to(map_sf)

for lat, lng in zip(dfPointDistance['point.lat'], dfPointDistance['point.lng']):
    folium.CircleMarker(
        [lat, lng],
        radius=1,
        color='red'
    ).add_to(map_sf)

map_sf


Let's check once more time that selected point are outside of 500m radius to the closest Yoga Studios.

In [24]:
for lat, lng, label in zip(dfYogaStudios['location.lat'], 
                                   dfYogaStudios['location.lng'], 
                                   dfYogaStudios.name):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(map_sf)

    folium.Circle(
        location=[lat, lng],
        radius=yogaRadius, 
        weight=1,
        fill=True,
        fill_color='blue',
        fill_opacity=0
    ).add_to(map_sf)

map_sf

## Clusterization <a class="anchor" id="h7"></a>
In this part we will group this point based of distance to each neighbor. 

For this task we will use *DBSCAN* algorithm from *sklearn.cluster* library

In [25]:
from sklearn.cluster import DBSCAN 
from sklearn.preprocessing import StandardScaler 

dfClustering = dfPointDistance[['point.lat','point.lng']]
# dfClustering = dfPointDistance[['score']]
dfClustering = StandardScaler().fit_transform(dfClustering)

eps = 0.1
samples = 3

db = DBSCAN(eps=eps, min_samples=samples).fit(dfClustering)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
dfPointDistance['cluster']=labels

realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels)) 

print('Cluster count=',len(dfPointDistance['cluster'].unique()))
print(dfPointDistance.groupby('cluster').agg({'cluster':['count']}))

Cluster count= 5
        cluster
          count
cluster        
0            33
1            23
2            38
3            61
4            20


Let's show result clusters on map

In [26]:
import matplotlib.pyplot as plt
import matplotlib

colors = [
    'red',
    'blue',
    'green',
    'purple',
    'orange',
    'cadetblue',
    'darkred',
    'lightred',
    'beige',
    'darkblue',
    'darkgreen',
    'lightgreen',
    'lightblue',
    'darkpurple',
    'pink',
    'black'
]

map_sf = folium.Map(location=[latPoints.mean(), lonPoints.mean()], zoom_start=14)

for clust_number in set(labels):
    c=(([0.4,0.4,0.4]) if clust_number == -1 else colors[np.int(clust_number)])
        
    clust_set = dfPointDistance[dfPointDistance.cluster == clust_number]    
    for lat, lng in zip(clust_set['point.lat'], clust_set['point.lng']):
        folium.CircleMarker(
            [lat, lng],
            radius=2,
            color=c
        ).add_to(map_sf)

map_sf

Let's select point with maximum score from each cluster

In [27]:
bestPoints = []
for cluster in dfPointDistance['cluster'].unique():
    dfCluster = dfPointDistance[ dfPointDistance['cluster'] == cluster]
    bestPoints.append( dfCluster[ dfCluster['score'] == dfCluster['score'].max() ].values[0] )
    
dfBestPoints = pd.DataFrame(data=bestPoints, columns=dfPointDistance.columns)
dfBestPoints.sort_values(by='score', ascending=False, inplace=True)
dfBestPoints

Unnamed: 0,point.lat,point.lng,distance.yoga,id.yoga,distance.coffee,id.coffee,distance.grocery,id.grocery,distance.bakery,id.bakery,score,score_scaled,cluster
3,37.784306,-122.40719,574.163652,599f66f0e679bc4d9cc802ef,40.054584,57a0e506498e6c087e114f97,205.289167,58a3a06dbbec6606e1b580a3,60.830437,5aecd497628c83002c67de8e,5.027059,1.0,3
1,37.776584,-122.393962,509.519036,57f8ada7498e2bb7c8392f1a,93.520882,4a844f01f964a5203bfc1fe3,37.133358,49d031f7f964a5200c5b1fe3,65.351607,49c1c6ebf964a520cb551fe3,4.278811,0.842475,1
0,37.777038,-122.410641,783.132733,5ba43ed51f7440002cba3098,22.852341,55554f0a498e7ffc6883325c,720.883583,459ac90cf964a52088401fe3,22.852341,55554f0a498e7ffc6883325c,3.648789,0.70984,0
4,37.797025,-122.405465,678.11013,4e1e18f37d8bb2842103efd3,22.655181,50ad5b86067d33c34bef6724,595.566121,4a5e8596f964a520b0be1fe3,129.490375,4a2ac0fdf964a52049961fe3,2.948709,0.562456,4
2,37.781126,-122.392812,889.059076,54498315498ebfc6bf328b60,152.853259,582352764972292ad70ec0bd,500.733092,49d031f7f964a5200c5b1fe3,191.201359,49e756dbf964a5208d641fe3,2.576122,0.484017,2


... and attach information about closest Yoga Studio, Coffee Shop, Grocery Store and Bakery

In [28]:
def f(x, df, colName):
    name = df[df['id'] == x][colName].iloc[0]
    return name

def attachNearestInfo(idColName, df, nameColName, addressColName):
    dfBestPoints[nameColName] = dfBestPoints[idColName].apply(lambda x: f(x, df,'name') )
    dfBestPoints[addressColName] = dfBestPoints[idColName].apply(lambda x: f(x, df,'location.address') )

attachNearestInfo('id.yoga', dfYogaStudios, 'yoga.name', 'yoga.address')
attachNearestInfo('id.coffee', dfCoffeeShops, 'coffee.name', 'coffee.address')
attachNearestInfo('id.grocery', dfGroceryStores, 'grocery.name', 'grocery.address')
attachNearestInfo('id.bakery', dfBakeries, 'bakery.name', 'bakery.address')

print(dfBestPoints)

   point.lat   point.lng  distance.yoga                   id.yoga  \
3  37.784306 -122.407190     574.163652  599f66f0e679bc4d9cc802ef   
1  37.776584 -122.393962     509.519036  57f8ada7498e2bb7c8392f1a   
0  37.777038 -122.410641     783.132733  5ba43ed51f7440002cba3098   
4  37.797025 -122.405465     678.110130  4e1e18f37d8bb2842103efd3   
2  37.781126 -122.392812     889.059076  54498315498ebfc6bf328b60   

   distance.coffee                 id.coffee  distance.grocery  \
3        40.054584  57a0e506498e6c087e114f97        205.289167   
1        93.520882  4a844f01f964a5203bfc1fe3         37.133358   
0        22.852341  55554f0a498e7ffc6883325c        720.883583   
4        22.655181  50ad5b86067d33c34bef6724        595.566121   
2       152.853259  582352764972292ad70ec0bd        500.733092   

                 id.grocery  distance.bakery                 id.bakery  ...  \
3  58a3a06dbbec6606e1b580a3        60.830437  5aecd497628c83002c67de8e  ...   
1  49d031f7f964a5200c5b1fe3   

Now we will use Nominatum once more time for getting address of each best point

In [29]:
def coordsToAddress(row):
    coords = '{}, {}'.format(row['point.lat'], row['point.lng'])
    result = geocode(coords)
        
    return result[0]

dfBestPoints['address'] = dfBestPoints.progress_apply(func=coordsToAddress, axis=1)

dfBestPoints.head()

100%|██████████| 5/5 [00:04<00:00,  1.02it/s]


Unnamed: 0,point.lat,point.lng,distance.yoga,id.yoga,distance.coffee,id.coffee,distance.grocery,id.grocery,distance.bakery,id.bakery,...,cluster,yoga.name,yoga.address,coffee.name,coffee.address,grocery.name,grocery.address,bakery.name,bakery.address,address
3,37.784306,-122.40719,574.163652,599f66f0e679bc4d9cc802ef,40.054584,57a0e506498e6c087e114f97,205.289167,58a3a06dbbec6606e1b580a3,60.830437,5aecd497628c83002c67de8e,...,3,Ritual Hot Yoga - Fidi,"USPS, Kearny Street, Union Square, SF, Califor...",Starbucks,"865 Market St, C 26A",Trader Joe's,10 4th Street,BAKE Cheese Tart,845 Market St,"Westfield San Francisco Centre, 845, Market St..."
1,37.776584,-122.393962,509.519036,57f8ada7498e2bb7c8392f1a,93.520882,4a844f01f964a5203bfc1fe3,37.133358,49d031f7f964a5200c5b1fe3,65.351607,49c1c6ebf964a520cb551fe3,...,1,CorePower Yoga,"1200 4th Street, Suite D & E",Philz Coffee,201 Berry St,Safeway,298 King St,Panera Bread,301 King St,"The Beacon, 260, King Street, South Beach, SF,..."
0,37.777038,-122.410641,783.132733,5ba43ed51f7440002cba3098,22.852341,55554f0a498e7ffc6883325c,720.883583,459ac90cf964a52088401fe3,22.852341,55554f0a498e7ffc6883325c,...,0,CorePower Yoga,150 Van Ness Ave Ste A,Vive La Tarte,1160 Howard St,Trader Joe's,555 9th St,Vive La Tarte,1160 Howard St,"Cellarmaker Brewing Co., 1150, Howard Street, ..."
4,37.797025,-122.405465,678.11013,4e1e18f37d8bb2842103efd3,22.655181,50ad5b86067d33c34bef6724,595.566121,4a5e8596f964a520b0be1fe3,129.490375,4a2ac0fdf964a52049961fe3,...,4,Yoga Studio at Equinox,301 Pine St,Réveille Coffee Co.,200 Columbus Ave,Safeway,145 Jackson St,Golden Gate Bakery,1029 Grant Ave,"Lifemark Group, Columbus Avenue, Chinatown, SF..."
2,37.781126,-122.392812,889.059076,54498315498ebfc6bf328b60,152.853259,582352764972292ad70ec0bd,500.733092,49d031f7f964a5200c5b1fe3,191.201359,49e756dbf964a5208d641fe3,...,2,CorePower Yoga,215 Fremont Street,Blue Bottle Coffee,2 South Park,Safeway,298 King St,The Bagel Bakery,151 Townsend St,"Dropbox, 333, Brannan Street, West SoMa, SF, C..."


## Conclusion <a class="anchor" id="h8"></a>
We can show best points on map:

In [32]:
from folium.plugins import BeautifyIcon

map_sf = folium.Map(location=[latPoints.mean(), lonPoints.mean()], zoom_start=14)
folium.GeoJson('./sf.geojson', name='Interested area').add_to(map_sf)

for clust_number in set(labels):
    c=(([0.4,0.4,0.4]) if clust_number == -1 else colors[np.int(clust_number)])
        
    clust_set = dfPointDistance[dfPointDistance.cluster == clust_number]    
    for lat, lng in zip(clust_set['point.lat'], clust_set['point.lng']):
        folium.CircleMarker(
            [lat, lng],
            radius=2,
            color=c
        ).add_to(map_sf)

for lat, lng, clust_number, address in zip(dfBestPoints['point.lat'], dfBestPoints['point.lng'], 
                                           dfBestPoints['cluster'], dfBestPoints['address']):
    c=(([0.4,0.4,0.4]) if clust_number == -1 else colors[np.int(clust_number)])
            
    folium.Marker(
            [lat, lng],
            icon=folium.Icon(color=c),
            popup = folium.Popup('{}'.format(address), max_width=300)
        ).add_to(map_sf)
    
map_sf

Or print them in text form:

In [31]:
print('Total cluster founded: {}'.format(dfBestPoints.shape[0]))
print('')
      
for index, row in dfBestPoints.iterrows():
    print('POI address: {}'.format(row['address']))
    print('Nearest Yoga Studio \'{}\' is {:.0f} metres away at {}'.format(row['yoga.name'],row['distance.yoga'],row['yoga.address']))
    print('Nearest Coffee Shop \'{}\' is {:.0f} metres away at {}'.format(row['coffee.name'],row['distance.coffee'],row['coffee.address']))
    print('Nearest Grocery Store \'{}\' is {:.0f} metres away at {}'.format(row['grocery.name'],row['distance.grocery'],row['grocery.address']))
    print('Nearest Bakery \'{}\' is {:.0f} metres away at {}'.format(row['bakery.name'],row['distance.bakery'],row['bakery.address']))
    print('')        

Total cluster founded: 5

POI address: Westfield San Francisco Centre, 845, Market Street, Union Square, SF, California, 94103, USA
Nearest Yoga Studio 'Ritual Hot Yoga - Fidi' is 574 metres away at USPS, Kearny Street, Union Square, SF, California, 94103-3124, USA
Nearest Coffee Shop 'Starbucks' is 40 metres away at 865 Market St, C 26A
Nearest Grocery Store 'Trader Joe's' is 205 metres away at 10 4th Street
Nearest Bakery 'BAKE Cheese Tart' is 61 metres away at 845 Market St

POI address: The Beacon, 260, King Street, South Beach, SF, California, 94107, USA
Nearest Yoga Studio 'CorePower Yoga' is 510 metres away at 1200 4th Street, Suite D & E
Nearest Coffee Shop 'Philz Coffee' is 94 metres away at 201 Berry St
Nearest Grocery Store 'Safeway' is 37 metres away at 298 King St
Nearest Bakery 'Panera Bread' is 65 metres away at 301 King St

POI address: Cellarmaker Brewing Co., 1150, Howard Street, West SoMa, SF, California, 94103, USA
Nearest Yoga Studio 'CorePower Yoga' is 783 metres 

Thanks for reading!