# Report for the Applied Data Science Capstone project  
by Andreas Johannes

<a id='top'></a>  

# Salvation of a Rosesalesman

[1 Background](#background)  
[2 Data sources and treatment](#data)  
[3 Methodology](#methods)  
 - [3.1 Paris map](#map)  
 - [3.2 Foursquare data](#foursquare)  
 - [3.3 Heat map](#heat)  
 - [3.4 k-means](#kmeans)
 
[4 Results](#results)  
[5 Discussion](#discussion)  
[6 Conclusion](#conclusion)  

<a id='background'></a>  
## 1 Background  
[back to top](#top)  

![Be this guy!](https://thumbs.dreamstime.com/z/money-7661988.jpg)

### Sell your Roses  here (or rather there)!

Parisean Rose seller, this could be you!
Whether you are selling Roses to couples or playing your Fiddle for tips, you want to know where the most restaurants and bars are, because that's were the most money can be made. Read on for an depth analysis of where to go tonight to ply your trade.

PLUS if you know you made money in one area, use our similarity rating to find similar areas for your next nights work!

![make the machiene work for you](https://thumbs.dreamstime.com/z/human-hand-receiving-rose-artificial-hand-senior-87445418.jpg)



### Summary:  
To find the best areas to sell Roses on the street:
 - Grade areas in Paris according to how many restaurants and bars there are in them
  - Show this data on a map of Paris
  - By restaurant category/type
 - Find locations which offer similar night life options
  - generally categorize areas
  - given a starting address, find similar areas  
Probably this analysis may be useful outside of the rose-selling market, but that's a future venture.

<a id='data'></a>  
## 2 Data sources and treatment  
[back to top](#top)  
### The heat-map   
 - We will segment Paris into evenly sized tiles
 - Use **Foursquare** to obtain a count for the restaurants and bars in each tile.
 - Categorize Restaurants and bars in 4-8 categories (eg. Bar, club, Fast Food etc.)
 - Use **folium** to plot heat-map tiles onto a map of Paris for each category
 - Sort by number of found places to suggest best areas.

### Similar areas  
Use above categories to find areas that are similar:  
 - Inspect distribution to see how many area categories are sensible
 - use k-means to group this number of categories
 - Plot to map of Paris
 - given a location, use generalized distance across features (as used in k-means algorithm) to produce a sorted list of areas similar to the current location.


 

<a id='methods'></a>  
## 3 Methodology
[back to top](#top)

In this section we will execute the strategy outlined in the previous section.

<a id='map'></a>  
### 3.1 Paris map  
[back to top](#top)

In [21]:
import numpy as np
import pandas as pd
import folium


Create a regular hexagonal grid around Paris. We wil use cube coordinates centered around the center of paris accordintg to [wiki: paris](https://en.wikipedia.org/wiki/Paris). The tiles will be spaced 200 m appart and we will have 50 tiles in each direction. This covers the center of Paris quite well and should have sufficient resolution.
See [https://www.redblobgames.com/grids/hexagons/] for an introduction to hexagonal coordinates.

In [307]:
# get a 3D grid from with 2*tile_count + 1 number of tiles across
tile_count = tc = 2
p_range, q_range, r_range = range(-tc,tc+1),range(-tc,tc+1),range(-tc,tc+1)
r_i, q_i, p_i = np.meshgrid(p_range, q_range, r_range)
pqr_i = np.stack([p_i.flat, q_i.flat, r_i.flat])
# reduce grid to include only the indexes on our hexagonal plane
hex_mask = pqr_i.sum(axis=0)==0
hex_mask
#xyz_hex = np.empty(shape=(3,hex_mask.sum()),dtype=np.int32)
pqr_hex = pqr_i[:,hex_mask]
pqr_hex.dtype, pqr_hex.T.shape

(dtype('int32'), (19, 3))

We have an index grid, not to convert it into geospacial coordinates. We want the spacing to be tile_size, and first need to convert that to angular distances. We will only cover a small segment of the sperical earth and use the apropriate simplifiations.
see [wiki: geographic coordinates](https://en.wikipedia.org/wiki/Geographic_coordinate_system)

In [308]:
tile_size = ts = 200. # m
earth_radius =  6367449
center_of_paris = (48.8567, 2.3508)
# in angle per meter
lat_conversion = 180./(np.pi*earth_radius)
lon_conversion = 180/(np.pi*earth_radius)*np.cos(np.pi/180.0*center_of_paris[0])
lat_conversion, lon_conversion

# defining vectors to get form the center of the hex to corner points in angles
h = 0.6*ts*lon_conversion
v = 0.6*ts*lat_conversion
s60 = np.sin(60./180.*np.pi)
c60 = np.cos(60./180.*np.pi)
x_step = (v, 0)
y_step = (-v*c60, h*s60)
z_step = (-v*c60, -h*s60)
step_vector = np.asarray((x_step, y_step, z_step)).T

def get_corners(step_vector, center):
    '''
    returns the list of coordinates for the corners of a hexagon defined by
    the hexagonal step vector and a center point
    ''' 
    coordinates = []
    perms = [[1,0,0],
             [0,0,-1],
             [0,1,0],
             [-1,0,0],
             [0,0,1],
             [0,-1,0]]
             
    for perm in perms:
        coordinates.append(list(center + np.dot(step_vector,perm)))
    return coordinates

We have all we need to create the hexagonal grid mapped over Paris.

In [309]:
# usefull library to create geojson files
# https://github.com/karimbahgat/PyGeoj
import pygeoj
# creating regular tiles around city center
json_tiles = pygeoj.new()
json_tiles_fname = "tiles.geojson"
coords_str_list = []
center_list = []
p_list = []
q_list = []
r_list = []
for coords in pqr_hex.T:
    # create a geojson file
    coords_str=('_').join([str(x) for x in coords])
    coords_str_list.append(coords_str)
    p_list.append(coords[0])
    q_list.append(coords[1])
    r_list.append(coords[2])
    
    center = center_of_paris[::-1] + np.dot(step_vector, coords)
    center_list.append(center)
    coordinates = get_corners(step_vector, center)
    json_tiles.add_feature(
        properties={"coords_str":"coords_str"},
        geometry={"type":"Polygon", "coordinates":[coordinates]})

json_tiles.add_all_bboxes()
json_tiles.update_bbox()
json_tiles.add_unique_id()
json_tiles.save(json_tiles_fname)
coordinates

[[2.348640424464626, 48.85793051899548],
 [2.3481005305807825, 48.85854577849322],
 [2.347020742813096, 48.85854577849322],
 [2.3464808489292523, 48.85793051899548],
 [2.347020742813096, 48.85731525949774],
 [2.3481005305807825, 48.85731525949774]]

In [310]:
# create a corresponding dataframe:
center_array = np.asarray(center_list)
df_tiles = pd.DataFrame({'coords_str':coords_str_list, 
                         'lat':center_array[:,0],
                         'lon':center_array[:,1]})
latdist_array = (np.asarray(df_tiles.lat)-center_of_paris[1])/lat_conversion
londist_array = (np.asarray(df_tiles.lon)-center_of_paris[0])/lon_conversion
df_tiles['distance_to_center'] = np.asarray(np.sqrt(latdist_array**2 + londist_array**2),
                                            dtype=np.int32)
df_tiles['p'] = p_list
df_tiles['q'] = q_list
df_tiles['r'] = r_list

In [311]:
map_paris = folium.Map(location=center_of_paris, zoom_start=12)
test_df = pd.DataFrame({'Paris':1}, columns=['City','Value'])
# Add the color for the chloropleth:
folium.Choropleth(
    geo_data=json_tiles_fname,
    name='choropleth',
    data=df_tiles,
    fill_color='Blues',
    columns=['coords_str', 'distance_to_center'],
    key_on='feature.properties.coords_str',
    fill_opacity=0.5, 
    line_opacity=0.1,
    legend_name='Distance to Center',   
).add_to(map_paris)


map_paris

In [312]:
df_tiles

Unnamed: 0,coords_str,lat,lon,distance_to_center,p,q,r
0,2_-2_0,2.354039,48.855469,415,2,-2,0
1,1_-2_1,2.35242,48.854854,360,1,-2,1
2,0_-2_2,2.3508,48.854239,415,0,-2,2
3,2_-1_-1,2.354039,48.8567,360,2,-1,-1
4,1_-1_0,2.35242,48.856085,207,1,-1,0
5,0_-1_1,2.3508,48.855469,207,0,-1,1
6,-1_-1_2,2.34918,48.854854,360,-1,-1,2
7,2_0_-2,2.354039,48.857931,415,2,0,-2
8,1_0_-1,2.35242,48.857315,207,1,0,-1
9,0_0_0,2.3508,48.8567,0,0,0,0


<a id='foursquare'></a>  
### 3.2 Foursquare data  
[back to top](#top)

Noe that we have the grid on which we want to check for locations, lets use foursquare to find them. We will immedeately collect different restaurant types seperately for later.
[see foursquare:categories](https://developer.foursquare.com/docs/resources/categories) 

In [313]:
import requests

# not sharing foursquare credentials
with open('../../foursquare_credentials.dat','r') as f:
    client_id, client_secret = f.readlines()
client_id = client_id[:-1]
version = '20180724'

Manual selection of some categories:

In [314]:
food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues
nightlife_category = '4d4b7105d754a06376d81259'# 'Root' category for all nightlife venues
# other categories:
categories_dict = {'other':['503288ae91d4c4b30a586d67',
                                 '4bf58dd8d48988d1c8941735',
                                 '4bf58dd8d48988d14e941735',
                                 '4bf58dd8d48988d169941735',
                                 '52e81612bcbc57f1066b7a01',
                                 '4bf58dd8d48988d1df931735',
                                 '52e81612bcbc57f1066b79f4',
                                 '4bf58dd8d48988d17a941735',
                                 '4bf58dd8d48988d144941735',
                                 '4bf58dd8d48988d108941735',
                                 '4bf58dd8d48988d120951735',
                                 '4bf58dd8d48988d1be941735',
                                 '4bf58dd8d48988d1c1941735',
                                 '56aa371be4b08b9a8d573508',
                                 '4bf58dd8d48988d1c4941735',
                                 '4bf58dd8d48988d1ce941735',
                                 '4bf58dd8d48988d1cc941735',
                                 '4bf58dd8d48988d1dc931735',
                                 '56aa371be4b08b9a8d573538'],
                        'sweet':['4bf58dd8d48988d146941735',
                                 '52e81612bcbc57f1066b79f2',
                                 '4bf58dd8d48988d1d0941735',
                                 '4bf58dd8d48988d148941735'],
                        'european':['52f2ae52bcbc57f1066b8b81',
                                    '5293a7d53cf9994f4e043a45',
                                    '4bf58dd8d48988d147941735',
                                    '5744ccdfe4b0c0459246b4d0',
                                    '4bf58dd8d48988d109941735',
                                    '52e81612bcbc57f1066b7a05',
                                    '52e81612bcbc57f1066b7a09',
                                    '4bf58dd8d48988d10c941735',
                                    '52e81612bcbc57f1066b79fa',
                                    '4bf58dd8d48988d110941735',
                                    '52e81612bcbc57f1066b79fd',
                                    '4bf58dd8d48988d1c0941735',
                                    '52e81612bcbc57f1066b79f9',
                                    '4bf58dd8d48988d1c2941735',
                                    '52e81612bcbc57f1066b7a04',
                                    '4def73e84765ae376e57713a',
                                    '5293a7563cf9994f4e043a44',
                                    '4bf58dd8d48988d1c6941735',
                                    '5744ccdde4b0c0459246b4a3',
                                    '56aa371be4b08b9a8d57355a',
                                    '4bf58dd8d48988d150941735',
                                    '4bf58dd8d48988d158941735',
                                    '4f04af1f2fb6e1c99f3db0bb',
                                    '52e928d0bcbc57f1066b7e96'],
                        'asian':['4bf58dd8d48988d142941735',
                                 '4bf58dd8d48988d10f941735',
                                 '4bf58dd8d48988d115941735',
                                 '52e81612bcbc57f1066b79f8',
                                 '5413605de4b0ae91d18581a9'],
                        'fast':['4bf58dd8d48988d179941735',
                                '4bf58dd8d48988d16a941735',
                                '52e81612bcbc57f1066b7a02',
                                '52e81612bcbc57f1066b79f1',
                                '4bf58dd8d48988d143941735',
                                '52e81612bcbc57f1066b7a0c',
                                '4bf58dd8d48988d16c941735',
                                '4bf58dd8d48988d128941735',
                                '4bf58dd8d48988d16d941735',
                                '4bf58dd8d48988d1e0931735',
                                '52e81612bcbc57f1066b7a00',
                                '4bf58dd8d48988d10b941735',
                                '4bf58dd8d48988d16e941735',
                                '4edd64a0c7ddd24ca188df1a',
                                '56aa371be4b08b9a8d57350b',
                                '4bf58dd8d48988d1cb941735',
                                '4d4ae6fc7a7b7dea34424761',
                                '5283c7b4e4b094cb91ec88d7',
                                '4bf58dd8d48988d1ca941735',
                                '4bf58dd8d48988d1c5941735',
                                '4bf58dd8d48988d1bd941735',
                                '4bf58dd8d48988d1c7941735',
                                '4bf58dd8d48988d1dd931735'],
                   'night_life':['52e81612bcbc57f1066b7a06',
                                 nightlife_category]}
# fix keys order:
key_list = list(categories_dict.keys())
key_list.sort()
key_list

['asian', 'european', 'fast', 'night_life', 'other', 'sweet']

Unfortunately, some of these are parent categories so we need to delve
a little deeper intho the foursquare category business, by example the nightlife category:

In [315]:
get_categories_url = 'https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
        client_id, client_secret, version)
all_foursquare_categories = requests.get(get_categories_url).json()['response']['categories']

In [316]:
def get_category_by_id(parent, category_id, result=None):
    if result == None:
        if type(parent) == list:    
            for parent_category in parent:
                result = get_category_by_id(parent_category, category_id, result)
        elif type(parent) == dict:
            if parent['id'] == category_id:
                return parent
            elif len(parent['categories'])!=0:
                for item in parent['categories']:
                    result = get_category_by_id(item, category_id, result)
            else:
                result = None
        return result
    else:
        return result
    
nightlife_categories = get_category_by_id(all_foursquare_categories, nightlife_category)

def get_descendant_categories(parent, categories=[], verbose=False):
    if type(parent) == list:    
        for parent_category in parent:
            categories = get_descendant_categories(parent_category, categories, verbose)
        return categories
    
    elif type(parent) == dict:
        if verbose:
            print(parent['name'], len(categories)+1)
        categories.append(parent['id'])
        if len(parent['categories'])==0:
            return categories
        else:
            for item in parent['categories']:
                categories = get_descendant_categories(item, categories, verbose)
        return categories
    
nl = get_descendant_categories(nightlife_categories, [], verbose=True)
len(nl)

Nightlife Spot 1
Bar 2
Beach Bar 3
Beer Bar 4
Beer Garden 5
Champagne Bar 6
Cocktail Bar 7
Dive Bar 8
Gay Bar 9
Hookah Bar 10
Hotel Bar 11
Karaoke Bar 12
Pub 13
Sake Bar 14
Speakeasy 15
Sports Bar 16
Tiki Bar 17
Whisky Bar 18
Wine Bar 19
Brewery 20
Lounge 21
Night Market 22
Nightclub 23
Other Nightlife 24
Strip Club 25


25

Now we can iterate over the above manually created categories dict to get a a comprehensive set of all related categories.

In [317]:
catsets_dict = {}
for key, categories in categories_dict.items():
    key_list = []
    for cat_id in categories:
        parent_cat = get_category_by_id(all_foursquare_categories, cat_id)
        key_list += get_descendant_categories(parent=parent_cat, categories=key_list, verbose=False)
    catsets_dict.update({key:set(key_list)})
for key, val in catsets_dict.items():
    print("category {} has {} id's".format(key, len(val)))

category other has 49 id's
category sweet has 9 id's
category european has 96 id's
category asian has 129 id's
category fast has 23 id's
category night_life has 26 id's


We define the functions that will GET the Foursquare data for each area around Paris and filter the categories.

In [318]:
def get_categories(categories):
    return [cat['id'] for cat in categories]

def count_categories(catsets_dict, found_categories):
    result = dict([[x,0] for x in catsets_dict.keys()])
    for key, categories_list in catsets_dict.items():
        for found_id_list in found_categories:
            for found_id in found_id_list:
                if found_id in categories_list:
                    result[key] += 1
    return result
    
def get_venues_near_location(lat, lon, client_id, client_secret, radius=250, limit=100):

    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, radius, limit)
    try:
        item_list = requests.get(url).json()['response']['groups'][0]['items']
        venue_categories = [get_categories(item['venue']['categories']) for item in item_list]
    except:
        venue_categories = []
        
    return venue_categories

In [321]:
counts_array = np.zeros(shape=(center_array.shape[1], len(key_list)), dtype=np.int32)
for i, coord in enumerate(center_array):
    foursquare_result = get_venues_near_location(coord[0], coord[1], client_id, client_secret, radius=250, limit=100)
    if len(foursquare_result)==0:
        counts
    counts = count_categories(catsets_dict=catsets_dict, found_categories=categories)
    for j, key in enumerate(key_list):
        counts_array[i,j] = counts[key]

KeyError: '52e81612bcbc57f1066b7a06'

In [323]:
foursquare_result

[]

[['4bf58dd8d48988d164941735'],
 ['4bf58dd8d48988d163941735'],
 ['4bf58dd8d48988d163941735'],
 ['57558b36e4b065ecebd306b6'],
 ['52f2ab2ebcbc57f1066b8b1b'],
 ['4bf58dd8d48988d10c951735'],
 ['4c2cd86ed066bed06c3c5209'],
 ['52f2ab2ebcbc57f1066b8b23'],
 ['4bf58dd8d48988d10c941735'],
 ['4bf58dd8d48988d103951735'],
 ['4bf58dd8d48988d108951735'],
 ['4bf58dd8d48988d145941735']]

In [236]:
a = [1,2,3,1]
[1] in a

False

In [148]:
len(nightlife_categories)

7

<a id='heat'></a>  
### 3.3 Heat map  
[back to top](#top)

<a id='kmeans'></a>  
### 3.4 k-means  
[back to top](#top)

<a id='results'></a>  
## 4 Results  
[back to top](#top)

<a id='discussion'></a>  
## 5 Discussion  
[back to top](#top)

<a id='conclusion'></a>  
## 6 Conclusion  
[back to top](#top)