# Capstone Project - The Battle of Neighbourhoods

## Exploring where to open a bike shop in Copenhagen
##### By Michele Deluchi (michele.deluchi@gmail.com)

## Table of contents
* [Introduction: Background](#introduction)
* [Introduction: Business Problem](#introduction)
* [Data extraction and wrangling](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Limitations](#limitations)
* [Conclusion](#conclusion)

## Introduction: Background
###### Data obtained through publicly accessible statistics provided by the Danish Cycling Embassy, Copenhagen Kommune, Frederiksberg Kommune, Statistics Denmark and various other official sources.

#### Copenhagen's cycling culture
Cycling in Copenhagen is – as with most cycling in Denmark – an important mean of transportation and a dominating feature of the cityscape, often noticed by visitors. The city offers a variety of favorable cycling conditions — dense urban proximities, short distances and flat terrain — along with an extensive and well-designed system of cycle tracks. This has earned it a reputation as one of the most—possibly the most—bicycle-friendly city in the world. Every day 1.2 million kilometers (0.75 million miles) are cycled in Copenhagen, with 62% of all citizens commuting to work, school or university by bicycle; in fact, almost as many people commute by bicycle in greater Copenhagen as do those who cycle to work in the entire United States. Cycling is generally perceived as a healthier, more environmentally friendly, cheaper, and often quicker way to get around town than by public transport or car.

#### The Danish Cycle market is expected to expand
In the private sector there are 289 bicycle shops and wholesale dealers in greater Copenhagen, as well as 20 companies that design and sell bicycles, mainly the city's signature cargo bikes, such as Christiania Bikes (Boxcycles in the U.S.), Nihola and Larry vs Harry, and luxury bike brands as Biomega and Velorbis. These firms generate 650 full-time jobs and a total estimated annual turnover of DKK 1.3 billion (US$222 million). Also, with the creation of cycle superhighways (cycling routes connecting different cities in the Zealand region with each other) and the advent of e-bikes in the mass market (in just one year the number sold and produced has gone up from 2,300 in 2017 to around 3,000 in 2018 – an increase of around 27 percent), the overall market size of the Danish bicycle industry is expected to grow throughout the next 5 years.

## Introduction: Business Problem

Overall, the brief introduction contained in the previous section laid the foundations for the definition of the business problem that this project aims to investigate. In synthesis:

* The cycle market is closely woven into Danish cultural fabric. Cycling indeed represents a crucial resource for Danes to move within urban and rural landscapes.
* The market is forecasted to expand throughout next years, in virtue of A) creation of new cycling infrastructures, and B) a growing demand for e-cycles.
* The competitive landscape for bike shops offering bike sales & repairs within the city of Copenhagen appears to be already densely inhabited.

Thus, these considerations lead us to the overarching problem statement, reportedly: 

##### To capture a (as big as possible) share of the expanding cycle market, where should someone establish a bike shop in Copenhagen?

## Data Extraction and Wrangling

Based on definition of our problem, factors that will influence our decission are:
* **number of and distance** to **bike shops** in the neighborhood, if any
* **distance** of neighborhood from the **closest bike trail**
* **distance** of neighborhood from the **city center**

I decided to use all postal codes from the City of Copenhagen to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* postal codes of the Greater Copenhagen Region have been scraped from **state registries** (can be found at: https://www.regionh.dk/english/about-the-capital-region/facts-about-the-region/PublishingImages/PostalcodesEnglish.pdf) and pre-processed to select only the areas included in the 'City of Copenhagen'
* missing addresses are obtained via the **Bing API** reverese geocoding feature.
* upon analysis, candidate areas will be alorithmically defined via **gridding**
* centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **arcgis and google geocoder APIs**
* number of bikeshops and their type and location in every neighborhood will be obtained using **Foursquare API**, along with bike trail coordinates
* coordinate of Copenhagen center will be obtained using **arcgis and google geocoder APIs** of well known and central Copenhagen location (Kongens Nytorv square)

To start, let's import all libraries that we will be using throughout the project.

In [1]:
import random 
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline 
from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs
import html5lib
from bs4 import BeautifulSoup
import lxml
import sqlalchemy
from IPython.display import clear_output
import json 
import requests 
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
import reverse_geocoder as rg 
from pprint import pprint 
from geopy import geocoders
print('Libraries imported.')

Libraries imported.


Then, let's import the csv containing all postal codes of Copenhagen, let's store the data in a pandas dataframe and then visualize the dataframe.

In [2]:
df = pd.read_csv("C:\\Users\\zmcd\\Desktop\\Capstone\\Copenhagen Municipalities.csv")
df.head(10)

Unnamed: 0,Postal Code,City,Street
0,877,Valby,Vigerslev Allé 18
1,900,København C,
2,910,København C,Ufrankerede svarforsendelser
3,929,København C,Ufrankerede svarforsendelser
4,999,København C,Emil Holms Kanal 20
5,1000,København K,Købmagergade 33
6,1001,København K,Postboks
7,1002,København K,Postboks
8,1003,København K,Postboks
9,1004,København K,Postboks


Apparently, a number of postal codes are associated to postboxes (physically speaking, boxes where mail is dropped). Since they are not associated to any specific address and may lead to redundancies, let's drop them.

In [3]:
df = df[df.Street != 'Postboks']
df.head()

Unnamed: 0,Postal Code,City,Street
0,877,Valby,Vigerslev Allé 18
1,900,København C,
2,910,København C,Ufrankerede svarforsendelser
3,929,København C,Ufrankerede svarforsendelser
4,999,København C,Emil Holms Kanal 20


Getting better! However, we can see that we are not ready yet, as a series of postal codes have missing entries for the address. To fix the issue, let's get postal code coordinates via arcgis geocoder API, and then let's use reverse geocoding of Bing Maps API to associate an address to the missing entries. The arcgis API request is created, and results are appended to the existing dataframe.

In [4]:
import geocoder

In [5]:
import geopy
from geopy.geocoders import Nominatim

In [6]:
def get_geocoder(Postcode):
    lat_lng_coords = None

    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Copenhagen, Denmark'.format(Postcode))
        lat_lng_coords = g.latlng
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return latitude,longitude

i=0
lat, lon = [], []
for val in df['Postal Code']:
    lat_value, lon_value = get_geocoder(val)
    lat.append(lat_value)
    lon.append(lon_value)
    clear_output(wait=True)
    i +=1
    print(i/len(df)*100)
          
df['Latitude']=lat
df['Longitude']=lon
df1=df.copy()

100.0


In [7]:
df1.tail(20)

Unnamed: 0,Postal Code,City,Street,Latitude,Longitude
638,1964,Frederiksberg C,Ingemannsvej,55.67567,12.56756
639,1965,Frederiksberg C,Erik Menveds Vej,55.67567,12.56756
640,1966,Frederiksberg C,Steenwinkelsvej,55.67567,12.56756
641,1967,Frederiksberg C,Svanemosegårdsvej,55.67567,12.56756
642,1970,Frederiksberg C,Rosenørns Allé 1-65 + 20-70,55.68222,12.551911
643,1971,Frederiksberg C,Adolph Steens Allé,55.67567,12.56756
644,1972,Frederiksberg C,Worsaaesvej,55.67567,12.56756
645,1973,Frederiksberg C,Jakob Dannefærds Vej,55.67567,12.56756
646,1974,Frederiksberg C,Julius Thomsens Gade Ulige nr,55.682955,12.554368
647,1999,Frederiksberg C,Rosenørns Allé 22,55.67567,12.56756


Good, we now have lat/lon coordinates for all addresses. We are ready to replace missing values with the respective addresses. For the purpose, we create a request for Bing REST API reverse geocoder, and we replace the values.

In [8]:
d_idx = df1['Street'].isnull()
coordinates = df1.loc[d_idx,['Latitude', 'Longitude']].values.tolist()
import geocoder
g = geocoder.bing(coordinates, method='batch_reverse', key='removed for sharing')

In [9]:
address_replace = pd.Series([result.address.split(",")[0] for result in g])
df1.loc[d_idx, 'Street'] = address_replace.values
df1.tail(12)

Unnamed: 0,Postal Code,City,Street,Latitude,Longitude
646,1974,Frederiksberg C,Julius Thomsens Gade Ulige nr,55.682955,12.554368
647,1999,Frederiksberg C,Rosenørns Allé 22,55.67567,12.56756
648,2000,Frederiksberg,Normasvej 31,55.6704,12.511796
649,2100,København Ø,2100 København Ø,55.705645,12.572474
650,2200,København N,Vedbækgade 12,55.696715,12.543886
651,2300,København S,Englandsvej 94,55.65277,12.601429
652,2400,København NV,Hovmestervej,55.709575,12.528592
653,2450,København SV,Händelsvej 21,55.64903,12.525542
654,2500,Valby,Høffdingsvej 5B,55.660105,12.501817
655,2620,Albertslund,Vesterbrogade 2A,55.67567,12.56756


Nice! Seems like we are pretty much good to go with this first dataset. Let's save it, load it to save API calls, and let's drop duplicates to ensure data consistency.

In [10]:
#df1.to_csv("C:\\Users\\zmcd\\Desktop\\Capstone\\Copenhagen_Neighbourhoods_with_coordinates.csv")

In [11]:
df = pd.read_csv("C:\\Users\\zmcd\\Desktop\\Capstone\\Copenhagen_Neighbourhoods_with_coordinates.csv")
df = df.sort_values('Postal Code').drop_duplicates(subset=['Longitude', 'Latitude'], keep='last')
df.sort_values('Postal Code')
df.reset_index(inplace=True)
del df['Unnamed: 0']
del df['index']
df

Unnamed: 0,Postal Code,City,Street,Latitude,Longitude
0,1050,København K,Kongens Nytorv,55.680453,12.586210
1,1051,København K,Nyhavn,55.679770,12.592205
2,1052,København K,Herluf Trolles Gade,55.679060,12.589269
3,1053,København K,Cort Adelers Gade,55.677804,12.590820
4,1054,København K,Peder Skrams Gade,55.677623,12.589280
5,1055,København K,August Bournonvilles Passage,55.678475,12.587254
6,1056,København K,Heibergsgade,55.678980,12.587791
7,1057,København K,Holbergsgade,55.678195,12.589415
8,1058,København K,Havnegade,55.677365,12.590504
9,1059,København K,Niels Juels Gade,55.677085,12.586630


And now let's visualize the results, to get a sense of the concentration of bikeshops throughout the city of Copenhagen.

In [12]:
cph_lat= 55.680278
cph_lon= 12.569167
cph_map = folium.Map(location=[cph_lat, cph_lon], zoom_start=12)
neighbourhoods = folium.map.FeatureGroup()

for lat, lng, in zip(df.Latitude, df.Longitude):
    neighbourhoods.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=8, 
            color='blue',
            fill=True,
            fill_color='red',
            fill_opacity=0.8
        )
    )

cph_map.add_child(neighbourhoods)
cph_map

From the visualization, we can see that different areas of Copenhagen have different concentrations of postcodes - this could turn out to be quite troublesome when extracting locations with Foursquare API, due to the static radius parameter (either values will be missing or we will have a lot of duplicates). Thus, let's split the datframe per granularity of postcodes per area (inner CPH vs. outer CPH).

In [13]:
#Splitting addresses by granularity
df1 = df.iloc[:392]
df2 = df.iloc[392:]

Now let's store our credentials for Foursquare API.

In [14]:
foursquare_client_id = 'removed for sharing' 
foursquare_client_secret = 'removed for sharing' 
version = '20180605'

Also, based on the API documentation, let's store the categoryId defining bikeshops.

In [15]:
Bike_Shop_Category = '4bf58dd8d48988d115951735'

Good! Now let's start extracting locations. We will begin with the high-granularity neighnourhoods, stored in df1. Before, let's create the API request.

In [16]:
LIMIT = 10000 # limit of number of venues returned by Foursquare API

radius = 15000 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
    foursquare_client_id, 
    foursquare_client_secret, 
    version, 
    cph_lat, 
    cph_lon,
    Bike_Shop_Category,
    radius, 
    LIMIT)
results = requests.get(url).json()

In [17]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Good! Now let's extract all bikeshops contained in the high-granularity areas.

In [19]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.shape

(63, 4)

In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius, Bike_Shop_Category):
    
    venues_list=[]
    cn=0
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
            foursquare_client_id, 
            foursquare_client_secret, 
            version, 
            lat, 
            lng,
            Bike_Shop_Category,
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        cn+=1
        clear_output(wait=True)
        print(cn/len(names)*100)

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Street', 
                  'Street_lat', 
                  'Street_lon', 
                  'Venue', 
                  'Venue_Latitude', 
                  'Venue_Longitude', 
                  'Venue_Category']
 
    return(nearby_venues)

Now, since we have a lot of detail for the inner area of Copenhagen, let's set a radius of 100 metres to be sure to capture all location while trying to minimize duplicates.

In [23]:
cph_venues_inner = getNearbyVenues(names=df1['Street'],
                             latitudes=df1['Latitude'],
                             longitudes=df1['Longitude'],
                             radius = 100,
                             Bike_Shop_Category=Bike_Shop_Category
                            )

100.0


In [24]:
print(cph_venues_inner.shape)
cph_venues_inner.head()

(172, 7)


Unnamed: 0,Street,Street_lat,Street_lon,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,Nyhavn,55.67977,12.592205,Copenhagen Bicycles,55.679035,12.592747,Bike Shop
1,Peder Skrams Gade,55.677623,12.58928,Gammelholm Cykler,55.67818,12.588661,Bike Shop
2,August Bournonvilles Passage,55.678475,12.587254,Gammelholm Cykler,55.67818,12.588661,Bike Shop
3,Holbergsgade,55.678195,12.589415,Gammelholm Cykler,55.67818,12.588661,Bike Shop
4,Laksegade,55.6781,12.58398,Coffee and Bikes,55.67878,12.583225,Bike Shop


Good! It seems we have 172 locations identified as bikeshops within inner Copenhagen. However, it also seems like we have a serie of duplicates. Let's drop them to avoid redundancy.

In [25]:
cph_venues_inner.drop_duplicates(subset='Venue', keep="first", inplace=True)
cph_venues_inner.head()

Unnamed: 0,Street,Street_lat,Street_lon,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,Nyhavn,55.67977,12.592205,Copenhagen Bicycles,55.679035,12.592747,Bike Shop
1,Peder Skrams Gade,55.677623,12.58928,Gammelholm Cykler,55.67818,12.588661,Bike Shop
4,Laksegade,55.6781,12.58398,Coffee and Bikes,55.67878,12.583225,Bike Shop
11,Østergade,55.67952,12.5824,Rapha Cycle Club,55.680043,12.582028,Bike Shop
12,Østergade,55.67952,12.5824,jupiter ekstra,55.679931,12.581227,Bike Shop


In [26]:
cph_venues_inner.reset_index(inplace=True)

In [27]:
del cph_venues_inner['index']
cph_venues_inner.head()

Unnamed: 0,Street,Street_lat,Street_lon,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,Nyhavn,55.67977,12.592205,Copenhagen Bicycles,55.679035,12.592747,Bike Shop
1,Peder Skrams Gade,55.677623,12.58928,Gammelholm Cykler,55.67818,12.588661,Bike Shop
2,Laksegade,55.6781,12.58398,Coffee and Bikes,55.67878,12.583225,Bike Shop
3,Østergade,55.67952,12.5824,Rapha Cycle Club,55.680043,12.582028,Bike Shop
4,Østergade,55.67952,12.5824,jupiter ekstra,55.679931,12.581227,Bike Shop


In [28]:
cph_venues_inner.shape

(83, 7)

Nice! We have found 83 unique bikeshops within inner Copenhagen. Now let's repeat the process for the Outer Copenhagen area. This time, though, to be sure to capture all locations, we will be increase the radius to 3000m, as now the amount of postcodes per area is only equal to 1.

In [29]:
cph_venues_outer = getNearbyVenues(names=df2['Street'],
                             latitudes=df2['Latitude'],
                             longitudes=df2['Longitude'],
                             radius = 3000,
                             Bike_Shop_Category=Bike_Shop_Category
                            )

100.0


In [30]:
print(cph_venues_outer.shape)
cph_venues_outer.head()

(252, 7)


Unnamed: 0,Street,Street_lat,Street_lon,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,Normasvej 31,55.6704,12.511796,Recycles,55.667021,12.54849,Bike Shop
1,Normasvej 31,55.6704,12.511796,A-H Cykler,55.669992,12.556318,Bike Shop
2,Normasvej 31,55.6704,12.511796,Rent a Bike CPH,55.671985,12.556067,Bike Shop
3,Normasvej 31,55.6704,12.511796,Fri BikeShop,55.666049,12.512653,Bike Shop
4,Normasvej 31,55.6704,12.511796,Sorico Cykler,55.665262,12.516797,Bike Shop


In [31]:
cph_venues_outer.drop_duplicates(subset='Venue', keep="first", inplace=True)
cph_venues_outer.head()

Unnamed: 0,Street,Street_lat,Street_lon,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,Normasvej 31,55.6704,12.511796,Recycles,55.667021,12.54849,Bike Shop
1,Normasvej 31,55.6704,12.511796,A-H Cykler,55.669992,12.556318,Bike Shop
2,Normasvej 31,55.6704,12.511796,Rent a Bike CPH,55.671985,12.556067,Bike Shop
3,Normasvej 31,55.6704,12.511796,Fri BikeShop,55.666049,12.512653,Bike Shop
4,Normasvej 31,55.6704,12.511796,Sorico Cykler,55.665262,12.516797,Bike Shop


In [32]:
cph_venues_outer.reset_index(inplace=True)


In [33]:
del cph_venues_outer['index']
cph_venues_outer.head()

Unnamed: 0,Street,Street_lat,Street_lon,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,Normasvej 31,55.6704,12.511796,Recycles,55.667021,12.54849,Bike Shop
1,Normasvej 31,55.6704,12.511796,A-H Cykler,55.669992,12.556318,Bike Shop
2,Normasvej 31,55.6704,12.511796,Rent a Bike CPH,55.671985,12.556067,Bike Shop
3,Normasvej 31,55.6704,12.511796,Fri BikeShop,55.666049,12.512653,Bike Shop
4,Normasvej 31,55.6704,12.511796,Sorico Cykler,55.665262,12.516797,Bike Shop


In [34]:
cph_venues_outer.shape

(86, 7)

Nice! We have found 86 more bikestores. Let's append them, keeping in mind that there may be duplicates. Let's drop them and keep the original values from inner Copenhagen, as the previous search had a 30x higher accuracy per address.

In [35]:
cph_venues = cph_venues_inner.append(cph_venues_outer)
cph_venues.drop_duplicates(subset='Venue', keep="first", inplace=True)
cph_venues.shape

(142, 7)

So, we had 89+86-142=33 duplicates. Now we have a final list of 142 unique bikeshops. Let's clean up the dataframe and save it as a csv.

In [36]:
cph_venues.reset_index(inplace=True)
del cph_venues['index']
cph_venues.head()

Unnamed: 0,Street,Street_lat,Street_lon,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,Nyhavn,55.67977,12.592205,Copenhagen Bicycles,55.679035,12.592747,Bike Shop
1,Peder Skrams Gade,55.677623,12.58928,Gammelholm Cykler,55.67818,12.588661,Bike Shop
2,Laksegade,55.6781,12.58398,Coffee and Bikes,55.67878,12.583225,Bike Shop
3,Østergade,55.67952,12.5824,Rapha Cycle Club,55.680043,12.582028,Bike Shop
4,Østergade,55.67952,12.5824,jupiter ekstra,55.679931,12.581227,Bike Shop


In [37]:
#cph_venues.to_csv("C:\\Users\\zmcd\\Desktop\\Capstone\\Copenhagen_Bikeshops.csv")

Now let's plot the datframe we obtained to see how bikestores are distributed around Copenhagen!

In [38]:
bikeshops=pd.read_csv("C:\\Users\\zmcd\\Desktop\\Capstone\\Copenhagen_Bikeshops.csv")
bikeshops.drop('Unnamed: 0', 1, inplace=True)
bikeshops.head()

Unnamed: 0,Street,Street_lat,Street_lon,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,Nyhavn,55.67977,12.592205,Copenhagen Bicycles,55.679035,12.592747,Bike Shop
1,Peder Skrams Gade,55.677623,12.58928,Gammelholm Cykler,55.67818,12.588661,Bike Shop
2,Laksegade,55.6781,12.58398,Coffee and Bikes,55.67878,12.583225,Bike Shop
3,Østergade,55.67952,12.5824,Rapha Cycle Club,55.680043,12.582028,Bike Shop
4,Østergade,55.67952,12.5824,jupiter ekstra,55.679931,12.581227,Bike Shop


In [39]:
cph_lat= 55.680278
cph_lon= 12.569167
cph_map = folium.Map(location=[cph_lat, cph_lon], zoom_start=12)
neighbourhoods = folium.map.FeatureGroup()

for lat, lng, in zip(bikeshops.Venue_Latitude, bikeshops.Venue_Longitude):
    neighbourhoods.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=8, 
            color='blue',
            fill=True,
            fill_color='red',
            fill_opacity=0.8
        )
    )

cph_map.add_child(neighbourhoods)
cph_map

Looking good. So now we have all the bikeshops in the Greater Copenhagen Area! 

This concludes the data gathering phase - we're now ready to use this data for analysis to produce the report on optimal locations for a new bikeshop in Copenhagen!

# Methodology

In this project we will direct our efforts on detecting areas of Copenhagen that have low bikeshop density.

In first step we have collected the required data. 

Second step in our analysis will be calculation and exploration of '**bikeshop density**' across different areas of Copenhagen - we will use **heatmaps** to identify a few promising areas close to center with low number of bikeshops in general and focus our attention on those areas.

In third and final step we will focus on most promising areas and within those create **clusters of locations that meet some basic requirements** established in discussion with stakeholders: we will take into consideration locations **with no competing bikeshop within a radius of 250 metres**, and between the candidate addresses we will select the one **closest to one of Copenhagen major bike trails**. We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

# Analysis

First, let's crete a map showing heatmap / density of bikeshops and try to extract some meaningful info from that. Also, let's show borders of Copenhagen boroughs on our map and a few circles indicating distance of 1km, 2km and 3km from Kongens Nytorv.

In [40]:
copenhagen_boroughs_url = 'https://raw.githubusercontent.com/codeforamerica/click_that_hood/master/public/data/copenhagen.geojson'
copenhagen_boroughs = requests.get(copenhagen_boroughs_url).json()

def boroughs_style(feature):
    return { 'color': 'black', 'fill': False }

In [41]:
bikeshops_latlons = [[res[4], res[5]] for res in bikeshops.values]

In [42]:
from folium import plugins
from folium.plugins import HeatMap
cph_center = [cph_lat, cph_lon]
cph_map = folium.Map(location=[cph_lat, cph_lon], zoom_start=13)

folium.TileLayer('cartodbpositron').add_to(cph_map) #cartodbpositron cartodbdark_matter
HeatMap(bikeshops_latlons).add_to(cph_map)
folium.Marker(cph_center).add_to(cph_map)
folium.Circle(cph_center, radius=1000, fill=False, color='grey').add_to(cph_map)
folium.Circle(cph_center, radius=2000, fill=False, color='grey').add_to(cph_map)
folium.Circle(cph_center, radius=3000, fill=False, color='grey').add_to(cph_map)
folium.GeoJson(copenhagen_boroughs, style_function=boroughs_style, name='geojson').add_to(cph_map)
cph_map

From the heatmap, it seems that most bikeshops in Copenhagen are concentrated in the immediate center of the city (Copenhagen K) and in the upper part of neighbouring district in the South-West (Copenhagen V, or Vesterbro). Interestingly, this leaves some competitive space for the opening of new stores within some of the other neighbouring districts. In particular, Frederiksberg and Norrebrø have a consistent amount of space comprised within a 2km radius from Kongens Nytorv, and also represent some of the most densely populated areas of the Greater Copenhagen zone.

### Frederiksberg

Fredriksberg is an affluent area, with large parks such as Søndermarken and Frederiksberg Have, as well as a number of educational institutions such as Copenhagen Business School (CBS), Technical Education Copenhagen (TEC), the University of Copenhagen, and the Royal Danish Academy of Music.
Furthermore, there are extensive and vibrant cultural attractions in Frederiksberg, with theatre, events and concert venues such as Aveny-T, Riddersalen, Betty Nansen Teatret, Cisternerne, Forum and KU.BE.

All in all, Frederiksberg is characterized by a vibrant student community, constituting a significant portion of the 70% of the Frederiksberg population that every day commutes to either work or education.

### Norrebrø

Nørrebro is a hip, multicultural neighborhood, popular with students and creative types. Kebab joints and indie shops line the main road, Nørrebrogade, and late-night bars are tucked into the side streets. Foodies head to the high-end eateries and trendy coffee spots on Jægersborggade. Nearby, the leafy paths of Assistens Cemetery wind past the graves of such notables as Hans Christian Andersen and Søren Kierkegaard.

In 2016, Copenhagen had 13,100 more bikes than cars and it is said Nørrebrogade is the busiest cycling street in Europe.

### So what?

Popular with tourists, relatively close to city center and well connected for bikers, those boroughs appear to justify further analysis.

Let's define new, more narrow region of interest, which will include low-bikeshop-count parts of Frederiksberg and Norrebrø closest to Kongens Nytorv.

To do this, we will first need to define a function for converting lat/lon coordinates to X/Y cartesina coordinates, as we will later require them for the creation of the grid segmenting candidate neighbourhoods.

In [43]:
import pyproj

import math

def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('CPH center longitude={}, latitude={}'.format(cph_center[1], cph_center[0]))
x, y = lonlat_to_xy(cph_center[1], cph_center[0])
print('CPH center UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('CPH center longitude={}, latitude={}'.format(lo, la))

Coordinate transformation check
-------------------------------
CPH center longitude=12.569167, latitude=55.680278
CPH center UTM X=347161.7414550098, Y=6173174.698220171
CPH center longitude=12.569167, latitude=55.680278


The function has been created. Now let's convert the coordinates of Kongens Nytorv, and let's scope down the region-of-interest (roi), based on the previously listed considerations. More specifically, let's scope down the interesectional area between the chosen areas - Frederiksberg and Norrebrø.

In [44]:
cph_center_x, cph_center_y = lonlat_to_xy(cph_center[1], cph_center[0])

In [45]:
roi_x_min = cph_center_x + 2000
roi_y_max = cph_center_y - 1000
roi_width = 4000
roi_height = 4000
roi_center_x = roi_x_min - 3800
roi_center_y = roi_y_max + 1800
roi_center_lon, roi_center_lat = xy_to_lonlat(roi_center_x, roi_center_y)
roi_center = [roi_center_lat, roi_center_lon]

cph_map = folium.Map(location=roi_center, zoom_start=13)
HeatMap(bikeshops_latlons).add_to(cph_map)
folium.Marker(cph_center).add_to(cph_map)
folium.Circle(roi_center, radius=1000, color='blue', fill=True, fill_opacity=0.4).add_to(cph_map)
folium.Circle(cph_center, radius=1000, fill=False, color='grey').add_to(cph_map)
folium.Circle(cph_center, radius=2000, fill=False, color='grey').add_to(cph_map)
folium.Circle(cph_center, radius=3000, fill=False, color='grey').add_to(cph_map)
folium.GeoJson(copenhagen_boroughs, style_function=boroughs_style, name='geojson').add_to(cph_map)
cph_map

Nice. We now have defined the boundaries of the area from which candidate addresses will be selected. To be able to look into indivdual addresses in detail, let's break down the area into smaller and more easy to process sub-sections. To do this, we will be generating a grid, segmenting the chosen area in equi-extended circular areas with 100 metres of radius.

In [46]:
k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_step = 100
y_step = 100 * k 
roi_y_min = roi_center_y - 2000
roi_x_min = roi_center_x - 2000
roi_y_max = roi_center_y + 2000
roi_x_max = roi_center_x + 2000
roi_latitudes = []
roi_longitudes = []
roi_xs = []
roi_ys = []
for i in range(0, int(51/k)):
    y = roi_y_min + i * y_step
    x_offset = 50 if i%2==0 else 0
    for j in range(0, 51):
        x = roi_x_min + j * x_step + x_offset
        d = calc_xy_distance(roi_center_x, roi_center_y, x, y)
        if (d <= 1001):
            lon, lat = xy_to_lonlat(x, y)
            roi_latitudes.append(lat)
            roi_longitudes.append(lon)
            roi_xs.append(x)
            roi_ys.append(y)

print(len(roi_longitudes), 'candidate neighborhood centers generated.')

365 candidate neighborhood centers generated.


In [47]:
cph_map = folium.Map(location=cph_center, zoom_start=13)
folium.Marker(cph_center, popup='Kongens Nytorv').add_to(cph_map)
for lat, lon in zip(roi_latitudes, roi_longitudes):
    folium.Circle([lat, lon], radius=50, color='blue', fill=False).add_to(cph_map)
    folium.Circle(cph_center, radius=1000, fill=False, color='grey').add_to(cph_map)
folium.Circle(cph_center, radius=2000, fill=False, color='grey').add_to(cph_map)
folium.Circle(cph_center, radius=3000, fill=False, color='grey').add_to(cph_map)
cph_map

And here is a visualization of how the 365 candidate sub-areas build up the overall selected area. The next step will be to filter out those locations with bikeshops present within a 250 metres radius. To do this, we will once again convert location coordinates to cartesian, store them in a dictionary, and use the dictionary to iterate over locations to find those addresses that are free from competition.

In [48]:
import utm
xs = []
ys = []

latitudes = bikeshops['Venue_Latitude']
longitudes = bikeshops['Venue_Longitude']

for i,r in zip(latitudes, longitudes):
    x, y = lonlat_to_xy(r, i)
    xs.append(x)
    ys.append(y)

In [49]:
bs_xy = pd.DataFrame(
    {'Bikeshop_XY_Latitude': xs,
     'Bikeshop_XY_Longitude': ys,
    })
bs_xy.head()
bs_dict = bs_xy.to_dict('split')
bs_dict = bs_dict['data']
bs_dict

[[348639.1891295843, 6172984.691885383],
 [348379.07073225186, 6172898.506436694],
 [348039.63526239037, 6172977.143842899],
 [347969.283975851, 6173120.3313958645],
 [347918.47244012996, 6173109.567633053],
 [347653.5145729254, 6173454.898373645],
 [347469.0431928887, 6173504.903877762],
 [347603.532297622, 6173152.102005131],
 [347232.9353438861, 6172985.869698325],
 [347299.790636913, 6173330.130865379],
 [347291.73462555057, 6173351.489803018],
 [348277.3631737734, 6173260.865123768],
 [348353.79591978196, 6173614.718403587],
 [348250.2284073555, 6173658.165285288],
 [348329.4817325276, 6173548.953005342],
 [348050.49512968125, 6173503.23864341],
 [347998.19132097566, 6173605.178060925],
 [347975.2417679616, 6173852.135542348],
 [347953.26980848773, 6173922.18716574],
 [347075.078431475, 6173497.081862704],
 [347083.5582312799, 6173503.416166067],
 [347146.63387260464, 6173729.179612192],
 [347004.54222162627, 6173789.423146185],
 [346996.5511747622, 6173809.436203252],
 [346974.21

In [50]:
def count_bikeshops_nearby(x, y, bs_dict, radius=250):    
    count = 0
    for res in bs_dict:
        res_x = res[0]; res_y = res[1]
        d = calc_xy_distance(x, y, res_x, res_y)
        if d<=radius:
            count += 1
    return count

roi_bikeshop_counts = []


print('Generating data on location candidates... ', end='')
for x, y in zip(roi_xs, roi_ys):
    count = count_bikeshops_nearby(x, y, bs_dict, radius=250)
    roi_bikeshop_counts.append(count)
print('done.')

Generating data on location candidates... done.


In [51]:
bikeshops_roi_locations = pd.DataFrame({'Latitude':roi_latitudes,
                                 'Longitude':roi_longitudes,
                                 'X':roi_xs,
                                 'Y':roi_ys,
                                 'Bikeshops nearby':roi_bikeshop_counts})

bikeshops_roi_locations.head()

Unnamed: 0,Latitude,Longitude,X,Y,Bikeshops nearby
0,55.678185,12.536687,345111.741455,6173014.0,0
1,55.678217,12.538276,345211.741455,6173014.0,1
2,55.678249,12.539864,345311.741455,6173014.0,1
3,55.678281,12.541453,345411.741455,6173014.0,1
4,55.678313,12.543042,345511.741455,6173014.0,0


As a result of this process, we now have for each address a count of bikeshops within a 250 metres radius. As previously mentioned, we want no nearby competition! Thus, let's only keep those addresses that have 0 nearby bikeshops.

In [52]:
good_bs_count = np.array((bikeshops_roi_locations['Bikeshops nearby']==0))
print('Locations with no bikeshops nearby:', good_bs_count.sum())

df_good_locations = bikeshops_roi_locations[good_bs_count]
df_good_locations.head()

Locations with no bikeshops nearby: 247


Unnamed: 0,Latitude,Longitude,X,Y,Bikeshops nearby
0,55.678185,12.536687,345111.741455,6173014.0,0
4,55.678313,12.543042,345511.741455,6173014.0,0
5,55.678344,12.544631,345611.741455,6173014.0,0
6,55.678915,12.534255,344961.741455,6173101.0,0
7,55.678947,12.535844,345061.741455,6173101.0,0


Good! 118 addresses were already occupied by competition. We are now left with 247. Let's visualize the results, first in comparison to existing competition, and then through a heatmap showing the concentration of candidate addresses over the territory.

In [53]:
good_latitudes = df_good_locations['Latitude'].values
good_longitudes = df_good_locations['Longitude'].values

good_locations = [[lat, lon] for lat, lon in zip(good_latitudes, good_longitudes)]

cph_map = folium.Map(location=roi_center, zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(cph_map)
HeatMap(bikeshops_latlons).add_to(cph_map)
folium.Circle(roi_center, radius=2000, color='blue', fill=True, fill_opacity=0.05).add_to(cph_map)
folium.Marker(cph_center).add_to(cph_map)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(cph_map) 
folium.GeoJson(copenhagen_boroughs, style_function=boroughs_style, name='geojson').add_to(cph_map)
cph_map

In [54]:
cph_map = folium.Map(location=roi_center, zoom_start=13)
HeatMap(good_locations, radius=25).add_to(cph_map)
folium.Marker(cph_center).add_to(cph_map)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(cph_map)
folium.GeoJson(copenhagen_boroughs, style_function=boroughs_style, name='geojson').add_to(cph_map)
cph_map

We now have a more precise idea of the distribution of the final selection of bikeshops. Let's use KMeans algorithm to cluster the results in 15 refined regions of interest, so to further drill down our selection.

In [55]:
from sklearn.cluster import KMeans

number_of_clusters = 15

good_xys = df_good_locations[['X', 'Y']].values
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(good_xys)

cluster_centers = [xy_to_lonlat(cc[0], cc[1]) for cc in kmeans.cluster_centers_]

cph_map = folium.Map(location=roi_center, zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(cph_map)
HeatMap(bikeshops_latlons).add_to(cph_map)
folium.Circle(roi_center, radius=2000, color='blue', fill=True, fill_opacity=0.1).add_to(cph_map)
folium.Marker(cph_center).add_to(cph_map)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=250, color='green', fill=True, fill_opacity=0.25).add_to(cph_map) 
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(cph_map)
folium.GeoJson(copenhagen_boroughs, style_function=boroughs_style, name='geojson').add_to(cph_map)
cph_map

Now let's zoom in a bit to get a better idea.

In [56]:
cph_map = folium.Map(location=roi_center, zoom_start=14)
folium.Marker(cph_center).add_to(cph_map)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=225, color='#00000000', fill=True, fill_color='#0066ff', fill_opacity=0.07).add_to(cph_map)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(cph_map)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=225, color='green', fill=False).add_to(cph_map) 
folium.GeoJson(copenhagen_boroughs, style_function=boroughs_style, name='geojson').add_to(cph_map)
cph_map

We have our clusters! However, 247 addresses would not really provide a very sharp recommendation. Thus, we are going to pick the addresses that correspond to the centers of our 15 clusters, as they are already distributed in terms of distance from each other by the KMean algorithm. To do this, we will use Google Places REST API, with which we will reverse geocode the coordinates of the cluster centers.

In [57]:
api_key='removed for sharing'

def get_address(api_key, latitude, longitude, verbose=False):
   try:
       url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&latlng={},{}'.format(api_key, latitude, longitude)
       response = requests.get(url).json()
       if verbose:
           print('Google Maps API JSON result =>', response)
       results = response['results']
       address = results[0]['formatted_address']
       return address
   except:
       return None

candidate_area_addresses = []
for lon, lat in cluster_centers:
   addr = get_address(api_key, lat, lon)
   candidate_area_addresses.append(addr)    

In [58]:
candidate_area_addresses

['L.I. Brandes Allé 10, 1956 Frederiksberg, Denmark',
 'Lundtoftegade 42, 2200 København, Denmark',
 'Guldborgvej 21, 2000 Frederiksberg, Denmark',
 'Kapelvej 4, 2200 København, Denmark',
 'Nordre Sti 4, 1870 Frederiksberg, Denmark',
 'Nyelandsvej 25B, 2000 Frederiksberg, Denmark',
 'Dronning Olgas Vej 30, 2000 Frederiksberg, Denmark',
 'Husumgade 29, 2200 København, Denmark',
 'Duevej 22, 2000 Frederiksberg, Denmark',
 'Bille Brahes Vej 10, 1963 Frederiksberg, Denmark',
 'Hans Tavsens Gade 40, 2200 København, Denmark',
 'Aksel Møllers Have 7, 2000 Frederiksberg, Denmark',
 'Rolighedsvej 903, 1958 Frederiksberg, Denmark',
 'Stefansgade 73, 2200 København, Denmark',
 'Rathsacksvej 14, 1862 Frederiksberg, Denmark']

In [59]:
results = pd.DataFrame(candidate_area_addresses, columns=['Best_addresses_where_to_open_a_bikeshop'])
results= pd.DataFrame(results.Best_addresses_where_to_open_a_bikeshop.str.split(',').tolist(),
                                  columns = ['Addresses','Postcode','State'])
del results['State']
results

Unnamed: 0,Addresses,Postcode
0,L.I. Brandes Allé 10,1956 Frederiksberg
1,Lundtoftegade 42,2200 København
2,Guldborgvej 21,2000 Frederiksberg
3,Kapelvej 4,2200 København
4,Nordre Sti 4,1870 Frederiksberg
5,Nyelandsvej 25B,2000 Frederiksberg
6,Dronning Olgas Vej 30,2000 Frederiksberg
7,Husumgade 29,2200 København
8,Duevej 22,2000 Frederiksberg
9,Bille Brahes Vej 10,1963 Frederiksberg


It seems like we have some winners! However, we still did not factor in a key aspect of our analysis: the proximity to bike trails. Indeed, we want locations that as much as possible represent a nexus for bike traffic. For this reason, we will extract from Foursquare API the coordinates of the main bike trails in the Greater Copenhagen area, and we will use such coordinates to further filter out our candidate addresses so to pick the one closest to a major bike trail.

Again, let's form the API request, and let's get the data.

In [60]:
results = pd.read_csv("C:\\Users\\zmcd\\Desktop\\Capstone\\WinningLocations.csv")

In [61]:
Bike_Trail_Category = '56aa371be4b08b9a8d57355e'
def getNearbyVenues(names, latitudes, longitudes, radius, Bike_Trail_Category):
    
    venues_list=[]
    cn=0
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
            foursquare_client_id, 
            foursquare_client_secret, 
            version, 
            cph_lat, 
            cph_lon,
            Bike_Trail_Category,
            radius, 
            LIMIT)
            
        # make the GET request
        result = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in result])
        cn+=1
        clear_output(wait=True)
        print(cn/len(names)*100)

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Street', 
                  'Street_lat', 
                  'Street_lon', 
                  'Venue', 
                  'Venue_Longitude', 
                  'Venue_Latitude', 
                  'Venue_Category']
 
    return(nearby_venues)

In [62]:
biketrails = getNearbyVenues(names=df2['Street'],
                            latitudes=df2['Latitude'],
                           longitudes=df2['Longitude'],
                            radius = 15000,
                            Bike_Trail_Category=Bike_Trail_Category
                           )

100.0


Now let's remove duplicates to ensure data consistency. Also, for some reason, the API has included a cemetery as part of the results to the biketrails request. Let's get rid of that value as well.

In [63]:
biketrails = biketrails.drop(biketrails.columns[[0, 1, 2]], 1)
biketrails.drop_duplicates(subset='Venue', inplace=True)
biketrails = biketrails[biketrails.Venue_Category != 'Cemetery']
biketrails

Unnamed: 0,Venue,Venue_Longitude,Venue_Latitude,Venue_Category
1,Lille Langebro,55.670966,12.579872,Bike Trail
2,Cykelsupersti C77,55.66113,12.516666,Bike Trail
3,Byskoven Cykelsti,55.63769,12.54768,Bike Trail
4,Cykeludfordringsbanen Lejren,55.683793,12.433432,Bike Trail
5,Det røde spor - Hareskov,55.766822,12.432131,Bike Trail
6,Vallensbæk Bikepark,55.627613,12.369144,Bike Trail
7,kongestien,55.785432,12.467046,Bike Trail


Good! Now we save and load the biketrails dataframe, and visualize the results in a map of the Copenhagen area.

In [64]:
biketrails = pd.read_csv("C:\\Users\\zmcd\\Desktop\\Capstone\\BikeTrails.csv")

In [65]:
cph_map = folium.Map(location=[cph_lat, cph_lon], zoom_start=10)
neighbourhoods = folium.map.FeatureGroup()

for lat, lng, in zip(biketrails.Venue_Longitude, biketrails.Venue_Latitude):
    neighbourhoods.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=8, 
            color='blue',
            fill=True,
            fill_color='red',
            fill_opacity=0.8
        )
    )

cph_map.add_child(neighbourhoods)
cph_map

It seems that the biketrails are quite scattered, with 4 out of 7 of them being outside of Copenhagen. Conveniently for us, we are not looking at those addresses that have a distance x< than a given value, but just for the address closest to a bike trails. Now let's convert the lat/lon coordinates of the bike trails to XY/cartesian coordinates so that we are able to calculate the distance from the candidate addresses via the calc_xy_distance formula that we previously defined.

In [66]:
xs = []
ys = []

latitudes = biketrails['Venue_Latitude']
longitudes = biketrails['Venue_Longitude']

for i,r in zip(latitudes, longitudes):
    x, y = lonlat_to_xy(r, i)
    xs.append(x)
    ys.append(y)

In [67]:
bt_xy = pd.DataFrame(
    {'Biketrail_XY_Latitude': xs,
     'Biketrail_XY_Longitude': ys,
    })
bt_xy.head()
bt_dict = bt_xy.to_dict('split')
bt_dict = bt_dict['data']
bt_dict[0]

[5295278.883427879, 1816585.2109932085]

It seems like we are almost good to go. For the final step, we need to retreive the XY coordinates of the winning locations. Since we previously stored them in lat/lon format within the cluster_centers variable, the next step will be to convert such cooordinates to XY/cartesian. For the purpose, we will reuse the code that we just applied for getting XY coordinates of biketrails.

In [68]:
clust_coord = pd.DataFrame(cluster_centers, columns=['Lat','Lon'])

In [69]:
win_xs = []
win_ys = []

latitudes = clust_coord['Lat']
longitudes = clust_coord['Lon']

for i,r in zip(latitudes, longitudes):
    x, y = lonlat_to_xy(r, i)
    win_xs.append(x)
    win_ys.append(y)

In [70]:
win_xy = pd.DataFrame(
    {'Winner_XY_Latitude': win_xs,
     'Winner_XY_Longitude': win_ys,
     })
win_xy.head()
win_dict = win_xy.to_dict('split')
win_dict = win_dict['data']
win_dict[0]

[5298027.501045904, 1811463.1687538149]

Nice! Now, all that is left to do is to use the calc_xy_distance formula to calculate the distances of the addresses from each biketrail, selecting the minimum value obtained for each one of the candidate addresses.

In [71]:
bikestore_biketrails_distance = []

for area_x, area_y in zip(win_xs, win_ys):
    min_distance = 10000
    for res in bt_dict:
        res_x = res[0]
        res_y = res[1]
        d = calc_xy_distance(area_x, area_y, res_x, res_y)
        if d<min_distance:
            min_distance = d
    bikestore_biketrails_distance.append(min_distance)

In [72]:
biketrails_distances = pd.DataFrame({
                                 'X':win_xs,
                                 'Y':win_ys,
                                 'Biketrails_distance':bikestore_biketrails_distance})

biketrails_distances 

Unnamed: 0,X,Y,Biketrails_distance
0,5298028.0,1811463.0,4732.391484
1,5299611.0,1811346.0,5629.798853
2,5298781.0,1809515.0,3825.844471
3,5298569.0,1813153.0,4754.102547
4,5297511.0,1811528.0,4546.74437
5,5298222.0,1809754.0,3526.758621
6,5298996.0,1810680.0,4724.093919
7,5299371.0,1812179.0,6013.221291
8,5299242.0,1809792.0,4363.360466
9,5297589.0,1812322.0,4848.899847


Here we have all min distances! But among them, which is the smallest?

In [73]:
minimum_distance = biketrails_distances.Biketrails_distance.min()
minimum_distance

3526.7586212781753

And to which address is it associated?

In [74]:
biketrails_distances.index[biketrails_distances['Biketrails_distance'] == minimum_distance].tolist()

[5]

In [75]:
winner = results.iloc[[5]].join(clust_coord.iloc[[5]])
winner

Unnamed: 0.1,Unnamed: 0,Addresses,Postcode,Lat,Lon
5,5,Nyelandsvej 25B,2000 Frederiksberg,12.529343,55.682535


### Here we have our winner! As the very conclusive step, let's visualize our results - the clusters of candidate addresses, and the location of the final winner.

In [76]:
winner_latlon=[55.682535, 12.529343]
winner_address=winner.iloc[0]['Addresses']

In [77]:
cph_map = folium.Map(location=roi_center, zoom_start=14)
folium.Circle(cph_center, radius=50, color='red', fill=True, fill_color='red', fill_opacity=1).add_to(cph_map)
for lonlat, addr in zip(cluster_centers, candidate_area_addresses):
    folium.Marker([lonlat[1], lonlat[0]], popup=addr).add_to(cph_map)    
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=250, color='#0000ff00', fill=True, fill_color='#0066ff', fill_opacity=0.05).add_to(cph_map)
cph_map

In [78]:
cph_map = folium.Map(location=roi_center, zoom_start=14)
folium.Circle(cph_center, radius=50, color='red', fill=True, fill_color='red', fill_opacity=1).add_to(cph_map)
folium.Marker(winner_latlon, icon=folium.Icon(color='yellow', icon='star'), popup=winner_address).add_to(cph_map) 
cph_map

# Results and discussion

The analysis shows that although there is a consistent number of bikeshops in Copenhagen (142 within the Copenhagen area), there are pockets of low bikeshop density fairly close to city center. Highest concentration of bikeshops was detected north-east and south-west from Kongens Nytorv, so attention was focused on the north-western area (as the south-east area is most isolated from tourist and resident traffic), corresponding to the intersection of the boroughs of Frederiksberg and Norrebrø. Interestingly, these two boroughs not only constitute a relatively low-competition zone, but also present significantly attractive features for the opening of bikeshops (frequency of bike commutes, general bike traffic and favorable demographics).

After directing our attention to this narrower area of interest (covering approx. 2x2km north-west from Kongens Nytorv) we first created a dense grid of location candidates (spaced 100m apart); those locations were then filtered so that those with already established bikeshops were removed. 

Those location candidates were then clustered to create zones of interest which contain the greatest number of location candidates. Addresses of centers of those zones were also generated using reverse geocoding to be used as markers/starting points for more detailed local analysis based on other factors, such as the proximity to a major bike trail. 

As a result, 15 zones containing the largest number of potential new bikeshop addresses were identified. This, of course, does not imply that those zones are actually optimal locations for a bikeshops. Purpose of this analysis was to only provide info on areas close to Copenhagen center but not crowded with existing bikeshops - it is entirely possible that there is a very good reason for small number of bikeshops in any of those areas, reasons which would make them unsuitable for a new bikeshop regardless of lack of competition in the area. Recommended zones should therefore be considered only as a starting point for more detailed analysis which could eventually result in location which has not only no nearby competition.

# Limitations

The results produced in the project are subjected to limitations in relation to A) methodology, B) data selection, and C) tool selection.

#### A) Methodology
The way the overall project was structured is the result of a series of deliberately (and non-deliberately) sub-optimal steps through which the code has been organized. For instance, there was no explicit need to extract postal codes, nor to use four different APIs. In relation to the former, for instance, bikeshop locations could have been extracted in result to a broad API request knowing just the lat/lon of Kongens Nytorv. In relation to the latter, the project could have been based on a single API to ensure more consistency of venue labeling and geocoding. However, I wanted to experiment with various APIs and different approaches to geocoding to get familiar with a broader and more versatile data science toolbox.

#### B) Data Selection
The data selected for the analysis represent just a minor amount of the different drivers that could have been used to filter out locations. For instance, average household income per neighborhood could have been used to further segment areas based on the likelihood of inhabitants to use bike as their main mean of transport. Similarly, historical data on bike traffic could have been used to define high-traffic areas within Copenhagen, so to maximize the number of potential customers eventually viewing the new venue. Again, these are just but few of the many other approaches that could have been considered for data selection in preparation for the analysis.

#### B) Tools
The reliability of the results is heavily dependent on the reliability of the tools used to extract data. Specifically, individual limitations of the APIs used throughout the project are a major factor in defining how comprehensive the information used for the analysis was. For instance, three APIs were tested for the extraction of bikeshop addresses (Google, Bing and Foursquare), and the one providing the most exhaustive results was selected (reportedly, Foursquare REST API). Nevertheless, to a citizen of Copenhagen it is evident how such results are incomplete - for example, a number of bikeshops in the area of Frederiksberg were not captured via the API, due to Foursquare API not recognizing them as bikeshops (categoryId was inconsistent). This kind of limitations can be overcome only via manual integration of data, an action that was considered to be out of scope due to time constraints.

# Conclusion

Purpose of this project was to identify Copenhagen areas close to center with low number of bikeshops in order to aid stakeholders in narrowing down the search for optimal location for a new bikeshop. By calculating bikeshop density distribution from Foursquare data we have first identified general boroughs that justify further analysis (Frederiksberg and Norrebrø), and then generated extensive collection of locations which satisfy some basic requirements regarding existing nearby bikeshops. Clustering of those locations was then performed in order to create major zones of interest (containing greatest number of potential locations) and addresses of cluster centers were created to be used as starting points for final exploration by stakeholders. 

Final decision on optimal bikeshop location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location, real estate availability, prices, socio-economic dynamics of every neighborhood etc.